Read Data

A new Spark session will contain no data. The first step is to either load data into your Spark session’s memory, or point Spark to the location of the data so it can access the data on-demand.

Exercise

For this exercise, we will start a “local” Spark session, and then transfer data from our R environment to the Spark session’s memory. To do that, we will use the copy_to() command:

library(sparklyr)

sc <- spark_connect(master = "local")

tbl_mtcars <- copy_to(sc, mtcars, "spark_mtcars")

If you are using the RStudio IDE, you will notice a new table in the Connections pane. The name of that table is spark_mtcars. That is the name of the data set inside the Spark memory. The tbl_mtcars variable does not contain any mtcars data, this variable contains the info that points to the location where the Spark session loaded the data to.

Calling the tbl_mtcars variable in R will download the first 1,000 records and display them :

tbl_mtcars
#> # Source: spark<spark_mtcars> [?? x 11]
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1
#>  2  21       6  160    110  3.9   2.88  17.0     0     1
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0
#> # … with more rows, and 2 more variables: gear <dbl>,
#> #   carb <dbl>

Notice that at the top of the data print out, it is noted that records were downloaded from Spark: Source: spark.

To clean up the session, we will now stop the Spark session:

spark_disconnect(sc)

Working with files

In a formal Spark environment, it will be rare when we would have to upload data from R into Spark.

Using sparklyr, you can tell Spark to read and write data. Spark is able to interact with multiple types of file systems, such as HDFS, S3 and local. Additionally, Spark is able to read several file types such as CSV, Parquet, Delta and JSON. sparklyr provides functions that makes it easy to access these features. See the Spark Data section for a full list of available functions.

The following command will tell Spark to read a CSV file, and to also load it into Spark memory.

# Do not run the next following command. It is for example purposes only.
spark_read_csv(sc, name = "test_table",  path = "/test/path/test_file.csv")