Using Apache Arrow
Apache Arrow is a cross-language development platform for in-memory data. Arrow is supported starting with
sparklyr 1.0.0 to improve performance when transferring data between Spark and R. You can find some performance benchmarks under:
Using Arrow from R requires installing:
- The Arrow Runtime: Provides a cross-language runtime library.
- The Arrow R Package: Provides support for using Arrow from R through an R package.
Installing from OS X requires Homebrew and executing from a terminal:
brew install apache-arrow
Currently, installing Arrow in Windows requires Conda and executing from a terminal:
conda install arrow-cpp=0.12.* -c conda-forge conda install pyarrow=0.12.* -c conda-forge
Please reference arrow.apache.org/install when installing Arrow for Linux.
As of this writing, the
arrow R package is not yet available in CRAN; however, this package can be installed using the
remotes package. First, install
Then install the R package from github as follows:
remotes::install_github("apache/arrow", subdir = "r", ref = "apache-arrow-0.12.0")
If you happen to have Arrow 0.11 installed, you will have to install
remotes::install_github("apache/arrow", subdir = "r", ref = "dc5df8f")
There are three main use cases for
- Data Copying: When copying data with
copy_to(), Arrow will be used.
- Data Collection: Also, when collecting either, implicitly by printing datasets or explicitly calling
- R Transformations: When using
spark_apply(), data will be transferred using Arrow when possible.
sparklyr one simply needs to import this library:
Attaching package: ‘arrow’ The following object is masked from ‘package:utils’: timestamp The following objects are masked from ‘package:base’: array, table
Some data types are mapped to slightly different, one can argue more correct, types when using Arrow. For instance, consider collecting 64 bit integers in
library(sparklyr) sc <- spark_connect(master = "local") integer64 <- sdf_len(sc, 2, type = "integer64") integer64
# Source: spark<?> [?? x 1] id <dbl> 1 1 2 2
sparklyr collects 64 bit integers as
double; however, using
# Source: spark<?> [?? x 1] id <S3: integer64> 1 1 2 2
64 bit integers are now being collected as proper 64 bit integer using the
The Arrow R package supports many data types; however, in cases where a type is unsupported,
sparklyr will fallback to not using arrow and print a warning.
library(sparklyr.nested) library(sparklyr) library(dplyr) library(arrow) sc <- spark_connect(master = "local") cars <- copy_to(sc, mtcars) sdf_nest(cars, hp) %>% group_by(cyl) %>% summarize(data = collect_list(data))
# Source: spark<?> [?? x 2] cyl data <dbl> <list> 1 6 <list > 2 4 <list > 3 8 <list > Warning message: In arrow_enabled_object.spark_jobj(sdf) : Arrow disabled due to columns: data