ml_lda(x, features = dplyr::tbl_vars(x), k = length(features), alpha = (50/k) + 1, beta = 0.1 + 1, ml.options = ml_options(), ...)
kin fitting (as currently EM optimizer only supports symmetric distributions, so all values in the vector should be the same). For Expectation-Maximization optimizer values should be > 1.0. By default
alpha = (50 / k) + 1, where
50/kis common in LDA libraries and +1 follows from Asuncion et al. (2009), who recommend a +1 adjustment for EM.
beta = 0.1 + 1, where 0.1 gives a small amount of smoothing and +1 follows Asuncion et al. (2009), who recommend a +1 adjustment for EM.
ml_optionsfor more details.
Fit a Latent Dirichlet Allocation (LDA) model to a Spark DataFrame.
The topics' distributions over terms are called "beta" in the original LDA paper by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009.
For terminology used in LDA model see Spark LDA documentation.
Original LDA paper (journal version): Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003.
Asuncion et al. (2009)