What do you thing about this

The RStudio team would like to share with you some test deployment environments you can use to start your Spark journey.

Disclaimer: Please note that these articles are meant as guides only, RStudio is not responsible for issues or charges incurred if used.

YARN Client

Amazon’s EMR

This example demonstrates a complete workflow using Hadoop and Hive with Amazon Elastic Map Reduce (EMR). We access our data with a Spark cluster, understand our data using sparklyr, and then communicate our insights via a flex dashboard.

Cloudera Express

This example demonstrates a complete workflow using Hadoop and Hive with Cloudera (CDH). In addition to the workflow, we show these useful web tools: Cloudera Manager, HUE, and the Spark UI.

Stand Alone

Amazon’s EC2

You can create a Spark cluster without Hadoop using Spark standalone mode. In this example will show you how to set up a standalone cluster in Amazon EC2.

Working with S3 data

Pairing Spark with S3 is becoming an increasingly popular approach. Because it separates the data from the computation, it lets us tear down or the Spark cluster when we are done with the analysis without losing the source data. We thought it would be a good idea to run some experiments to find a recommendation that may work for those who are currently or thinking about using this approach for their analyses.

Performance and Tuning

Understanding Spark Caching

By using a reproducible example, we will review some of the main configuration settings, commands and command arguments that can be used that can help you get the best out of Spark's memory management options.

sparklyr is an RStudio project. © 2016 RStudio, Inc.