Option 2 - Working inside of Databricks
If the recommended path of connecting to Spark remotely with Databricks Connect does not apply to your use case, then you can install RStudio Server Pro directly within a Databricks cluster as described in the sections below.
With this configuration, RStudio Server Pro is installed on the Spark driver
node and allows users to work locally with Spark using
This configuration can result in increased complexity, limited connectivity to other storage and compute resources, resource contention between RStudio Server Pro and Databricks, and maintenance concerns due to the ephemeral nature of Databricks clusters.
For additional details, refer to the FAQ for RStudio in the Databricks Documentation.
Advantages and limitations
- Ability for users to connect
sparklyrto Spark without configuring remote connectivity
- Provides a high-bandwidth connection between R and the Spark JVM processes because they are running on the same machine
- Can load data from the cluster directly into an R session since RStudio Server Pro is installed within the Databricks cluster
- If the Databricks cluster is restarted or terminated, then the instance of RStudio Server Pro will be terminated and its configuration will be lost
- If users do not persist their code through version control or the Databricks File System, then you risk losing user’s work if the cluster is restarted or terminated
- RStudio Server Pro (and other RStudio products) installed within a Databricks cluster will be limited to the compute resources and lifecycle of that particular Spark cluster
- Non-Spark jobs will use CPU and RAM resources within the Databricks cluster
- Need to install one instance of RStudio Server Pro per Spark cluster that you want to run jobs on
- A running Databricks cluster with a runtime version 4.1 or above
- The cluster must not have “table access control” or “automatic termination” enabled
- You must have “Can Attach To” permission for the Databricks cluster
The following steps walk through the process to install RStudio Server Pro on the Spark driver node within your Databricks cluster.
The recommended method for installing RStudio Server Pro to the Spark driver node is via SSH. However, an alternative method is available if you are not able to access the Spark driver node via SSH.
Configure SSH access to the Spark driver node
Configure SSH access to the Spark driver node in Databricks by following the steps in the SSH access to clusters section of the Databricks Cluster configurations documentation.
Note: If you are unable to configure SSH access or connect to the Spark driver node via SSH, then you can follow the steps in the Get started with RStudio Server Pro section of the RStudio on Databricks documentation to install RStudio Server Pro from a Databricks notebook, then skip to the access RStudio Server Pro section of this documentation.
Connect to the Spark driver node via SSH
Connect to the Spark driver node via SSH on port 2200 by using the following command on your local machine:
ssh ubuntu@<spark-driver-node-address> -p 2200 -i <path-to-private-SSH-key>
<spark-driver-node-address> with the DNS name or IP address of the
Spark driver node, and
<path-to-private-SSH-key> with the path to your private
SSH key on your local machine.
Install RStudio Server Pro on the Spark driver node
After you SSH into the Spark driver node, then you can follow the typical steps to install RStudio Server Pro in the RStudio documentation. In the installation steps, you can select Ubuntu as the target Linux distribution.
Configure RStudio Server Pro
The following configuration steps are required to be able to use RStudio Server Pro with Databricks.
Add the following configuration lines to
/etc/rstudio/rserver.conf to use
proxied authentication with Databricks and enable the administrator dashboard:
auth-proxy=1 auth-proxy-user-header-rewrite=^(.*)$ $1 auth-proxy-sign-in-url=<domain>/login.html admin-enabled=1
Add the following configuration line to
/etc/rstudio/rsession-profile to set
PATH to be used with RStudio Server Pro:
Add the following configuration lines to
configure sessions in RStudio Server Pro to work with Databricks:
Restart RStudio Server Pro:
sudo rstudio-server restart
Access RStudio Server Pro
From the Databricks console, click on the Databricks cluster that you want to work with:
From within the Databricks cluster, click on the
Click on the
Set up RStudio button:
To access RStudio Server Pro, click on the link to
If you configured proxied authentication in RStudio Server Pro as described in the previous section, then you do not need to use the username or password that is displayed. Instead, RStudio Server Pro will automatically login and start a new RStudio session as your logged-in Databricks user:
Other users can access RStudio Server Pro from the Databricks console by following the same steps described above. You do not need to create those users in RStudio Server Pro or their home directory beforehand.
Use the following R code to establish a connection from
sparklyr to the
SparkR::sparkR.session() library(sparklyr) sc <- spark_connect(method = "databricks")