```
library(sparklyr)
<- spark_connect(master = "local")
sc <- copy_to(sc, mtcars, overwrite = TRUE) mtcars_tbl
```

# Model Data

You can orchestrate machine learning algorithms in a Spark cluster via the machine learning functions within `sparklyr`

. These functions connect to a set of high-level APIs built on top of DataFrames that help you create and tune machine learning workflows.

## Exercise

Here’s an example where we use `ml_linear_regression()`

to fit a linear regression model. We’ll use the built-in `mtcars`

dataset, and see if we can predict a car’s fuel consumption (`mpg`

) based on its weight (`wt`

), and the number of cylinders the engine contains (`cyl`

). We’ll assume in each case that the relationship between `mpg`

and each of our features is linear.

### Initialize the environment

We will start by creating a local Spark session and load the `mtcars`

data frame to it.

### Prepare the data

Spark provides data frame operations that makes it easier to prepare data for modeling. In this case, we will use the `sdf_partition()`

command to divide the `mtcars`

data into “training” and “test”.

```
<- mtcars_tbl %>%
partitions select(mpg, wt, cyl) %>%
sdf_random_split(training = 0.5, test = 0.5, seed = 1099)
```

Note that the newly created `partitions`

variable does not contain data, it contains a pointer to where the data was split within Spark. That means that no data is downloaded to the R session.

### Fit the model

Next, we will fit a linear model to the training data set:

```
<- partitions$training %>%
fit ml_linear_regression(mpg ~ .)
fit
```

```
Formula: mpg ~ .
Coefficients:
(Intercept) wt cyl
38.927395 -4.131014 -0.938832
```

For linear regression models produced by Spark, we can use `summary()`

to learn a bit more about the quality of our fit, and the statistical significance of each of our predictors.

`summary(fit)`

```
Deviance Residuals:
Min 1Q Median 3Q Max
-3.4891 -1.5262 -0.1481 0.8508 6.3162
Coefficients:
(Intercept) wt cyl
38.927395 -4.131014 -0.938832
R-Squared: 0.8469
Root Mean Squared Error: 2.416
```

### Use the model

We can use `ml_predict()`

to create a Spark data frame that contains the predictions against the testing data set.

```
<- ml_predict(fit, partitions$test)
pred
head(pred)
```

```
# Source: spark<?> [?? x 4]
mpg wt cyl prediction
<dbl> <dbl> <dbl> <dbl>
1 14.3 3.57 8 16.7
2 14.7 5.34 8 9.34
3 15 3.57 8 16.7
4 15.2 3.44 8 17.2
5 15.2 3.78 8 15.8
6 15.5 3.52 8 16.9
```

### Further reading

Spark machine learning supports a wide array of algorithms and feature transformations and as illustrated above it’s easy to chain these functions together with `dplyr`

pipelines. To learn more see the Machine Learning article on this site. For a list of Spark ML models available through `sparklyr`

visit Reference - ML