Perform linear regression on a Spark DataFrame.

ml_linear_regression(x, response, features, intercept = TRUE, alpha = 0,
  lambda = 0, iter.max = 100L, ml.options = ml_options(), ...)

Arguments

x
An object coercable to a Spark DataFrame (typically, a tbl_spark).
response
The name of the response vector (as a length-one character vector), or a formula, giving a symbolic description of the model to be fitted. When response is a formula, it is used in preference to other parameters to set the response, features, and intercept parameters (if available). Currently, only simple linear combinations of existing parameters is supposed; e.g. response ~ feature1 + feature2 + .... The intercept term can be omitted by using - 1 in the model fit.
features
The name of features (terms) to use for the model fit.
intercept
Boolean; should the model be fit with an intercept term?
alpha, lambda
Parameters controlling loss function penalization (for e.g. lasso, elastic net, and ridge regression). See Details for more information.
iter.max
The maximum number of iterations to use.
ml.options
Optional arguments, used to affect the model generated. See ml_options for more details.
...
Optional arguments. The data argument can be used to specify the data to be used when x is a formula; this allows calls of the form ml_linear_regression(y ~ x, data = tbl), and is especially useful in conjunction with do.

Details

Spark implements for both \(L1\) and \(L2\) regularization in linear regression models. See the preamble in the http://spark.apache.org/docs/latest/ml-classification-regression.html documentation for more details on how the loss function is parameterized.

In particular, with alpha set to 1, the parameterization is equivalent to a https://en.wikipedia.org/wiki/Lasso_(statistics) model; if alpha is set to 0, the parameterization is equivalent to a https://en.wikipedia.org/wiki/Tikhonov_regularization model.

See also

Other Spark ML routines: ml_als_factorization, ml_decision_tree, ml_generalized_linear_regression, ml_gradient_boosted_trees, ml_kmeans, ml_lda, ml_logistic_regression, ml_multilayer_perceptron, ml_naive_bayes, ml_one_vs_rest, ml_pca, ml_random_forest, ml_survival_regression