Chi-square hypothesis testing for categorical data.

R/ml_stat.R

ml_chisquare_test

Description

Conduct Pearson’s independence test for every feature against the label. For each feature, the (feature, label) pairs are converted into a contingency matrix for which the Chi-squared statistic is computed. All label and feature values must be categorical.

Usage

 
ml_chisquare_test(x, features, label) 

Arguments

Arguments Description
x A tbl_spark.
features The name(s) of the feature columns. This can also be the name of a single vector column created using ft_vector_assembler().
label The name of the label column.

Value

A data frame with one row for each (feature, label) pair with p-values, degrees of freedom, and test statistics.

Examples

library(sparklyr)
 
sc <- spark_connect(master = "local") 
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) 
 
features <- c("Petal_Width", "Petal_Length", "Sepal_Length", "Sepal_Width") 
 
ml_chisquare_test(iris_tbl, features = features, label = "Species") 
#>        feature   label      p_value degrees_of_freedom statistic
#> 1  Petal_Width Species 0.000000e+00                 42 271.75000
#> 2 Petal_Length Species 0.000000e+00                 84 271.80000
#> 3 Sepal_Length Species 6.665987e-09                 68 156.26667
#> 4  Sepal_Width Species 6.016031e-05                 44  89.54629