Spark’s ML Pipelines provide a way to easily combine multiple transformations and algorithms into a single workflow, or pipeline.
For R users, the insights gathered during the interactive sessions with Spark can now be converted to a formal pipeline. This makes the hand-off from Data Scientists to Big Data Engineers a lot easier, this is because there should not be additional changes needed to be made by the later group.
The final list of selected variables, data manipulation, feature transformations and modeling can be easily re-written into a ml_pipeline() object, saved, and ultimately placed into a Production environment. The sparklyr output of a saved Spark ML Pipeline object is in Scala code, which means that the code can be added to the scheduled Spark ML jobs, and without any dependencies in R.
Introduction to ML Pipelines
The official Apache Spark site contains a more complete overview of ML Pipelines. This article will focus in introducing the basic concepts and steps to work with ML Pipelines via sparklyr.
There are two important stages in building an ML Pipeline. The first one is creating a Pipeline. A good way to look at it, or call it, is as an “empty” pipeline. This step just builds the steps that the data will go through. This is the somewhat equivalent of doing this in R:
Functional sequence with the following components:
1. mutate(., cyl = paste0("c", cyl))
2. lm(am ~ cyl + mpg, data = .)
Use 'functions' to extract the individual functions.
The r_pipeline object has all the steps needed to transform and fit the model, but it has not yet transformed any data. The second step, is to pass data through the pipeline, which in turn will output a fitted model. That is called a PipelineModel. The PipelineModel can then be used to produce predictions.
r_model <-r_pipeline(mtcars)r_model
Call:
lm(formula = am ~ cyl + mpg, data = .)
Coefficients:
(Intercept) cylc6 cylc8 mpg
-0.54388 0.03124 -0.03313 0.04767
Taking advantage of Pipelines and PipelineModels
The two stage ML Pipeline approach produces two final data products:
A PipelineModel that can be added to the daily Spark jobs which will produce new predictions for the incoming data, and again, with no R dependencies.
A Pipeline that can be easily re-fitted on a regular interval, say every month. All that is needed is to pass a new sample to obtain the new coefficients.
Pipeline
An additional goal of this article is that the reader can follow along, so the data, transformations and Spark connection in this example will be kept as easy to reproduce as possible.
Pipelines make heavy use of Feature Transformers. If new to Spark, and sparklyr, it would be good to review what these transformers do. These functions use the Spark API directly to transform the data, and may be faster at making the data manipulations that a dplyr (SQL) transformation.
This example will start with dplyr transformations, which are ultimately SQL transformations, loaded into the df variable.
In sparklyr, there is one feature transformer that is not available in Spark, ft_dplyr_transformer(). The goal of this function is to convert the dplyr code to a SQL Feature Transformer that can then be used in a Pipeline.
Use the ml_param() function to extract the “statement” attribute. That attribute contains the finalized SQL statement. Notice that the flights table name has been replace with __THIS__. This allows the pipeline to accept different table names as its source, making the pipeline very modular.
Notice that there are no coefficients defined yet. That’s because no data has been actually processed. Even though df uses spark_flights(), recall that the final SQL transformer makes that name, so there’s no data to process yet.
PipelineModel
A quick partition of the data is created for this exercise.
The ml_save() command can be used to save the Pipeline and PipelineModel to disk. The resulting output is a folder with the selected name, which contains all of the necessary Scala scripts:
The ml_load() command can be used to re-load Pipelines and PipelineModels. The saved ML Pipeline files can only be loaded into an open Spark session.
reloaded_model <-ml_load(sc, "flights_model")
A simple query can be used as the table that will be used to make the new predictions. This of course, does not have to done in R, at this time the “flights_model” can be loaded into an independent Spark session outside of R.
new_df <- spark_flights %>%filter( month ==7, day ==5 )ml_transform(reloaded_model, new_df)
Use ml_fit() again to pass new data, in this case, sample_frac() is used instead of sdf_partition() to provide the new data. The idea being that the re-fitting would happen at a later date than when the model was initially fitted.