The RStudio team would like to share with you some test deployment environments you can use to start your Spark journey.

Disclaimer: Please note that these articles are meant as guides only, RStudio is not responsible for issues or charges incurred if used.

YARN Client

Data Science using a Data Lake

We have noticed that the types of questions we field after a demo of sparklyr to our customers were more about high-level architecture than how the package works. To answer those questions, we put together a set of slides that illustrate and discuss important concepts, to help customers see where Spark, R, and sparklyr fit in a Big Data Platform implementation. In this article, we<80><99>ll review those slides and provide a narrative that will help you better envision how you can take advantage of our products.

Amazon’s EMR

This example demonstrates a complete workflow using Hadoop and Hive with Amazon Elastic Map Reduce (EMR). We access our data with a Spark cluster, understand our data using sparklyr, and then communicate our insights via a flex dashboard.

Cloudera Express

This example demonstrates a complete workflow using Hadoop and Hive with Cloudera (CDH). In addition to the workflow, we show these useful web tools: Cloudera Manager, HUE, and the Spark UI.

Stand Alone

Amazon’s EC2

You can create a Spark cluster without Hadoop using Spark standalone mode. In this example will show you how to set up a standalone cluster in Amazon EC2.

Working with S3 data

Pairing Spark with S3 is becoming an increasingly popular approach. Because it separates the data from the computation, it lets us tear down or the Spark cluster when we are done with the analysis without losing the source data. We thought it would be a good idea to run some experiments to find a recommendation that may work for those who are currently or thinking about using this approach for their analyses.

Performance and Tuning

Understanding Spark Caching

By using a reproducible example, we will review some of the main configuration settings, commands and command arguments that can be used that can help you get the best out of Spark's memory management options.

sparklyr is an RStudio project. © 2016 RStudio, Inc.