Examples

Option 1 - Connecting to Databricks remotely Overview With this configuration, RStudio Server Pro is installed outside of the Spark cluster and allows users to connect to Spark remotely using sparklyr with Databricks Connect. This is the recommended configuration because it targets separate environments, involves a typical configuration process, avoids resource contention, and allows RStudio Server Pro to connect to Databricks as well as other remote storage and compute resources. Advantages and limitations Advantages: RStudio Server Pro will remain functional if Databricks clusters are terminated Provides the ability to communicate with one or more Databricks clusters as a remote compute resource Avoids resource contention between RStudio Server Pro and Databricks Limitations:
Option 2 - Working inside of Databricks Overview If the recommended path of connecting to Spark remotely with Databricks Connect does not apply to your use case, then you can install RStudio Server Pro directly within a Databricks cluster as described in the sections below. With this configuration, RStudio Server Pro is installed on the Spark driver node and allows users to work locally with Spark using sparklyr. This configuration can result in increased complexity, limited connectivity to other storage and compute resources, resource contention between RStudio Server Pro and Databricks, and maintenance concerns due to the ephemeral nature of Databricks clusters.
Spark Standalone Deployment in AWS Overview The plan is to launch 4 identical EC2 server instances. One server will be the Master node and the other 3 the worker nodes. In one of the worker nodes, we will install RStudio server. What makes a server the Master node is only the fact that it is running the master service, while the other machines are running the slave service and are pointed to that first master.
Using RStudio Server Pro inside of Qubole Overview Qubole users can request access to RStudio Server Pro. This allows users to use sparklyr to interact directly with Spark from within the Qubole cluster. Advantages and limitations Advantages: Ability for users to connect sparklyr directly to Spark within Qubole Provides a high-bandwidth connection between R and the Spark JVM processes because they are running on the same machine Can load data from the cluster directly into an R session since RStudio Server Pro is installed within the Qubole cluster A unique, persistent home directory for each user Limitations:
Using sparklyr with an Apache Spark cluster Summary This document demonstrates how to use sparklyr with an Cloudera Hadoop & Spark cluster. Data are downloaded from the web and stored in Hive tables on HDFS across multiple worker nodes. RStudio Server is installed on the master node and orchestrates the analysis in spark. Cloudera Cluster This demonstration is focused on adding RStudio integration to an existing Cloudera cluster. The assumption will be made that there no aid is needed to setup and administer the cluster.
Using sparklyr with an Apache Spark cluster This document demonstrates how to use sparklyr with an Apache Spark cluster. Data are downloaded from the web and stored in Hive tables on HDFS across multiple worker nodes. RStudio Server is installed on the master node and orchestrates the analysis in spark. Here is the basic workflow. Data preparation Set up the cluster This demonstration uses Amazon Web Services (AWS), but it could just as easily use Microsot, Google, or any other provider.
Using sparklyr with Databricks Overview This documentation demonstrates how to use sparklyr with Apache Spark in Databricks along with RStudio Team, RStudio Server Pro, RStudio Connect, and RStudio Package Manager. Using RStudio Team with Databricks RStudio Team is a bundle of our popular professional software for developing data science projects, publishing data products, and managing packages. RStudio Team and sparklyr can be used with Databricks to work with large datasets and distributed computations with Apache Spark.
Using sparklyr with Qubole Overview This documentation demonstrates how to use sparklyr with Apache Spark in Qubole along with RStudio Server Pro and RStudio Connect. Best practices for working with Qubole Manage packages via Qubole Environments - Packages installed via install.packages() are not available on cluster restart. Packages managed through Qubole Environments are persistent. Restrict workloads to interactive analysis - Only perform workloads related to exploratory or interactive analysis with Spark, then write the results to a database, file system, or cloud storage for more efficient retrieval in apps, reports, and APIs.