Perform k-means clustering on a Spark DataFrame.

ml_kmeans(x, centers, iter.max = 100, features = dplyr::tbl_vars(x),
compute.cost = TRUE, tolerance = 1e-04, ml.options = ml_options(), ...)

## Arguments

x |
An object coercable to a Spark DataFrame (typically, a
`tbl_spark` ). |

centers |
The number of cluster centers to compute. |

iter.max |
The maximum number of iterations to use. |

features |
The name of features (terms) to use for the model fit. |

compute.cost |
Whether to compute cost for `k-means` model using Spark's computeCost. |

tolerance |
Param for the convergence tolerance for iterative algorithms. |

ml.options |
Optional arguments, used to affect the model generated. See
`ml_options` for more details. |

... |
Optional arguments. The `data` argument can be used to
specify the data to be used when `x` is a formula; this allows calls
of the form `ml_linear_regression(y ~ x, data = tbl)` , and is
especially useful in conjunction with `do` . |

## Value

ml_model object of class `kmeans`

with overloaded `print`

, `fitted`

and `predict`

functions.

## References

Bahmani et al., Scalable K-Means++, VLDB 2012

## See also

For information on how Spark k-means clustering is implemented, please see
http://spark.apache.org/docs/latest/mllib-clustering.html#k-means.

Other Spark ML routines: `ml_als_factorization`

,
`ml_decision_tree`

,
`ml_generalized_linear_regression`

,
`ml_gradient_boosted_trees`

,
`ml_lda`

, `ml_linear_regression`

,
`ml_logistic_regression`

,
`ml_multilayer_perceptron`

,
`ml_naive_bayes`

,
`ml_one_vs_rest`

, `ml_pca`

,
`ml_random_forest`

,
`ml_survival_regression`