Option 1 - Connecting to Databricks remotely
With this configuration, RStudio Server Pro is installed outside of the Spark
cluster and allows users to connect to Spark remotely using
This is the recommended configuration because it targets separate environments, involves a typical configuration process, avoids resource contention, and allows RStudio Server Pro to connect to Databricks as well as other remote storage and compute resources.
Advantages and limitations
- RStudio Server Pro will remain functional if Databricks clusters are terminated
- Provides the ability to communicate with one or more Databricks clusters as a remote compute resource
- Avoids resource contention between RStudio Server Pro and Databricks
- Databricks Connect does not currently support the following APIs from
sparklyr: Broom APIs, Streaming APIs, Broadcast APIs, Most MLlib APIs,
csv_fileserialization mode, and the
- Databricks Connect does not support structured streaming
- Databricks Connect does not support running arbitrary code that is not a part of a Spark job on the remote cluster
- Databricks Connect does not support Scala, Python, and R APIs for Delta table operations
- Databricks Connect does not support most utilities in Databricks Utilities.
For more information on the limitations of Databricks Connect, refer to the Limitation section of the Databricks Connect documentation.
- RStudio Server Pro installed outside of the Databricks cluster
- Java 8 installed on the machine with RStudio Server Pro
- A running Databricks cluster with a runtime version 5.5 or above
The Databricks Connect client is provided as a Python library. The minor version of your Python installation must be the same as the minor Python version of your Databricks cluster.
Refer to the steps in the install Python section of the RStudio Documentation to install Python on the same server where RStudio Server Pro is installed.
Note that you can either install Python for all users in a global location (as an administrator) or in a home directory (as an end user).
Install Databricks Connect
Run the following command to install Databricks Connect on the server with RStudio Server Pro:
pip install -U databricks-connect==6.3.* # or a different version to match your Databricks cluster
Note that you can either install this library for all users in a global Python
environment (as an administrator) or for an individual user in their Python
environment (e.g., using the
pip --user option or installing into a conda
environment or virtual environment).
Configure Databricks Connect
To configure the Databricks Connect client, you can run the following command in a terminal when logged in as a user in RStudio Server Pro:
In the prompts that follow, enter the following information:
|Databricks Host||Base address of your Databricks console URL||
|Databricks Token||User token generated from the Databricks Console under your “User Settings”||
|Cluster ID||Cluster ID in the Databricks console under Advanced Options > Tags >
|Org ID||Found in the
|Port||The port that Databricks Connect connects to||
After you’ve completed the configuration process for Databricks Connect, you can run the following command in a terminal to test the connectivity of Databricks Connect to your Databricks cluster:
The integration of
sparklyr with Databricks Connect is currently being added
to the development version of
sparklyr. To use this functionality now, you’ll
need to install the development version of
sparklyr by running the following
command in an R console:
To work with a remote Databricks cluster, you need to have a local installation of Spark that matches the version of Spark on the Databricks Cluster.
You can install Spark by running the following command in an R console:
You can specify the version of Spark to install along with other options. Refer
spark_install() options in the
sparklyr reference documentation
for more information.
In order to connect to Databricks using
SPARK_HOME must be set to the output of the
databricks-connect get-spark-home command.
You can set
SPARK_HOME as an environment variable or directly within
spark_connect(). The following R code demonstrates connecting to Databricks,
copying some data into the cluster, summarizing that data using
library(sparklyr) library(dplyr) databricks_connect_spark_home <- system("databricks-connect get-spark-home", intern = TRUE) sc <- spark_connect(method = "databricks", spark_home = databricks_connect_spark_home) cars_tbl <- copy_to(sc, mtcars, overwrite = TRUE) cars_tbl %>% group_by(cyl) %>% summarise(mean_mpg = mean(mpg, na.rm = TRUE), mean_hp = mean(hp, na.rm = TRUE)) spark_disconnect(sc)
For more information on the setup, configuration, troubleshooting, and limitations of Databricks Connect, refer to the Databricks Connect section of the Databricks documentation.