Feature Transformation -- QuantileDiscretizer

Usage

ft_quantile_discretizer(x, input_col = NULL, output_col = NULL, n_buckets = 5)

Arguments

x
An object (usually a spark_tbl) coercable to a Spark DataFrame.
input_col
The name of the input column(s).
output_col
The name of the output column.
n_buckets
The number of buckets to use.

Description

Takes a column with continuous features and outputs a column with binned categorical features. The bin ranges are chosen by taking a sample of the data and dividing it into roughly equal parts. The lower and upper bin bounds will be -Infinity and +Infinity, covering all real values. This attempts to find numBuckets partitions based on a sample of the given input data, but it may find fewer depending on the data sample values.

Details

Note that the result may be different every time you run it, since the sample strategy behind it is non-deterministic.

See also

See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark. Other feature transformation routines: ft_binarizer, ft_bucketizer, ft_discrete_cosine_transform, ft_elementwise_product, ft_index_to_string, ft_one_hot_encoder, ft_sql_transformer, ft_string_indexer, ft_vector_assembler, sdf_mutate