Using sparklyr with Qubole


This documentation demonstrates how to use sparklyr with Apache Spark in Qubole along with RStudio Server Pro and RStudio Connect.

Best practices for working with Qubole

  • Manage packages via Qubole Environments - Packages installed via install.packages() are not available on cluster restart. Packages managed through Qubole Environments are persistent.
  • Restrict workloads to interactive analysis - Only perform workloads related to exploratory or interactive analysis with Spark, then write the results to a database, file system, or cloud storage for more efficient retrieval in apps, reports, and APIs.
  • Load and query results efficiently - Because of the nature of Spark computations and the associated overhead, Shiny apps that use Spark on the backend tend to have performance and runtime issues; consider reading the results from a database, file system, or cloud storage instead.

Using RStudio Server Pro with Qubole

The Qubole platform includes RStudio Server Pro. More details about how to request RStudio Server Pro and access it from within a Qubole cluster are available from Qubole.

View steps for running RStudio Server Pro inside Qubole

Using RStudio Connect with Qubole

The best configuration for working with Qubole and RStudio Connect is to install RStudio Connect outside of the Qubole cluster and connect to Qubole remotely. This is accomplished using the Qubole ODBC Driver.