Spark Standalone Deployment in AWS

Overview

The plan is to launch 4 identical EC2 server instances. One server will be the Master node and the other 3 the worker nodes. In one of the worker nodes, we will install RStudio server.

What makes a server the Master node is only the fact that it is running the master service, while the other machines are running the slave service and are pointed to that first master. This simple setup, allows us to install the same Spark components on all 4 servers and then just add RStudio to one of them.

The topology will look something like this:

AWS EC Instances

Here are the details of the EC2 instance, just deploy one at this point:

  • Type: t2.medium
  • OS: Ubuntu 16.04 LTS
  • Disk space: At least 20GB
  • Security group: Open the following ports: 8080 (Spark UI), 4040 (Spark Worker UI), 8088 (sparklyr UI) and 8787 (RStudio). Also open All TCP ports for the machines inside the security group.

Spark

Perform the steps in this section on all of the servers that will be part of the cluster.

Install Java 8

  • We will add the Java 8 repository, install it and set it as default
sudo apt-add-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
sudo apt-get install oracle-java8-set-default
sudo apt-get update

or alternatively, run

sudo apt install openjdk-8-jdk

to install Open JDK version 8.

Download Spark

wget http://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-hadoop2.7.tgz
tar -xvzf spark-2.1.0-bin-hadoop2.7.tgz
cd spark-2.1.0-bin-hadoop2.7

Create and launch AMI

  • We will create an image of the server. In Amazon, these are called AMIs, for information please see the User Guide.

  • Launch 3 instances of the AMI

RStudio Server

Select one of the nodes to execute this section. Please check the RStudio download page for the latest version

Install R

  • In order to get the latest R core, we will need to update the source list in Ubuntu.
sudo sh -c 'echo "deb http://cran.rstudio.com/bin/linux/ubuntu xenial/" >> /etc/apt/sources.list'
gpg --keyserver keyserver.ubuntu.com --recv-key 0x517166190x51716619e084dab9
gpg -a --export 0x517166190x51716619e084dab9 | sudo apt-key add -
sudo apt-get update
  • Now we can install R
sudo apt-get install r-base
sudo apt-get install gdebi-core

Install RStudio

  • We will download and install 1.044 of RStudio Server. To find the latest version, please visit the RStudio website. In order to get the enhanced integration with Spark, RStudio version 1.044 or later will be needed.
wget https://download2.rstudio.org/rstudio-server-1.0.153-amd64.deb
sudo gdebi rstudio-server-1.0.153-amd64.deb

Install dependencies

  • Run the following commands
sudo apt-get -y install libcurl4-gnutls-dev
sudo apt-get -y install libssl-dev
sudo apt-get -y install libxml2-dev

Add default user

  • Run the following command to add a default user
sudo adduser rstudio-user

Start the Master node

  • Select one of the servers to become your Master node

  • Run the command that starts the master service

sudo spark-2.1.0-bin-hadoop2.7/sbin/start-master.sh
  • Close the terminal connection (optional)

Start Worker nodes

  • Start the “slave” service. Important: Use dots not dashes as separators for the Spark Master node’s address
sudo spark-2.1.0-bin-hadoop2.7/sbin/start-slave.sh spark://[Master node's IP address]:7077

sudo spark-2.1.0-bin-hadoop2.7/sbin/start-slave.sh spark://ip-172-30-1-94.us-west-2.compute.internal:7077

  • Close the terminal connection (optional)

Pre-load pacakges

  • Log into RStudio (port 8787)

  • Use ‘rstudio-user’

install.packages("sparklyr")

Connect to the Spark Master

  • Navigate to the Spark Master’s UI, typically on port 8080

  • Note the Spark Master URL

  • Logon to RStudio

  • Run the following code

library(sparklyr)

conf <- spark_config()
conf$spark.executor.memory <- "2GB"
conf$spark.memory.fraction <- 0.9

sc <- spark_connect(master="[Spark Master URL]",
              version = "2.1.0",
              config = conf,
              spark_home = "/home/ubuntu/spark-2.1.0-bin-hadoop2.7/"
              )