Feature Transformation -- QuantileDiscretizer


ft_quantile_discretizer(x, input_col = NULL, output_col = NULL, n_buckets = 5)


An object (usually a spark_tbl) coercable to a Spark DataFrame.
The name of the input column(s).
The name of the output column.
The number of buckets to use.


Takes a column with continuous features and outputs a column with binned categorical features. The bin ranges are chosen by taking a sample of the data and dividing it into roughly equal parts. The lower and upper bin bounds will be -Infinity and +Infinity, covering all real values. This attempts to find numBuckets partitions based on a sample of the given input data, but it may find fewer depending on the data sample values.


Note that the result may be different every time you run it, since the sample strategy behind it is non-deterministic.

See also

See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark. Other feature transformation routines: ft_binarizer, ft_bucketizer, ft_discrete_cosine_transform, ft_elementwise_product, ft_index_to_string, ft_one_hot_encoder, ft_sql_transformer, ft_string_indexer, ft_vector_assembler, sdf_mutate