Feature Transformation -- QuantileDiscretizer (Estimator)

ft_quantile_discretizer takes a column with continuous features and outputs a column with binned categorical features. The number of bins can be set using the num_buckets parameter. It is possible that the number of buckets used will be smaller than this value, for example, if there are too few distinct values of the input to create enough distinct quantiles.

ft_quantile_discretizer(x, input_col = NULL, output_col = NULL,
  num_buckets = 2L, input_cols = NULL, output_cols = NULL,
  num_buckets_array = NULL, handle_invalid = "error",
  relative_error = 0.001, dataset = NULL,
  uid = random_string("quantile_discretizer_"), ...)

Arguments

x

A spark_connection, ml_pipeline, or a tbl_spark.

input_col

The name of the input column.

output_col

The name of the output column.

num_buckets

Number of buckets (quantiles, or categories) into which data points are grouped. Must be greater than or equal to 2.

input_cols

Names of input columns.

output_cols

Names of output columns.

num_buckets_array

Array of number of buckets (quantiles, or categories) into which data points are grouped. Each value must be greater than or equal to 2.

handle_invalid

(Spark 2.1.0+) Param for how to handle invalid entries. Options are 'skip' (filter out rows with invalid values), 'error' (throw an error), or 'keep' (keep invalid values in a special additional bucket). Default: "error"

relative_error

(Spark 2.0.0+) Relative error (see documentation for org.apache.spark.sql.DataFrameStatFunctions.approxQuantile here for description). Must be in the range [0, 1]. default: 0.001

dataset

(Optional) A tbl_spark. If provided, eagerly fit the (estimator) feature "transformer" against dataset. See details.

uid

A character string used to uniquely identify the feature transformer.

...

Optional arguments; currently unused.

Value

The object returned depends on the class of x.

  • spark_connection: When x is a spark_connection, the function returns a ml_transformer, a ml_estimator, or one of their subclasses. The object contains a pointer to a Spark Transformer or Estimator object and can be used to compose Pipeline objects.

  • ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the transformer or estimator appended to the pipeline.

  • tbl_spark: When x is a tbl_spark, a transformer is constructed then immediately applied to the input tbl_spark, returning a tbl_spark

Details

NaN handling: null and NaN values will be ignored from the column during QuantileDiscretizer fitting. This will produce a Bucketizer model for making predictions. During the transformation, Bucketizer will raise an error when it finds NaN values in the dataset, but the user can also choose to either keep or remove NaN values within the dataset by setting handle_invalid If the user chooses to keep NaN values, they will be handled specially and placed into their own bucket, for example, if 4 buckets are used, then non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4].

Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for org.apache.spark.sql.DataFrameStatFunctions.approxQuantile here for a detailed description). The precision of the approximation can be controlled with the relative_error parameter. The lower and upper bin bounds will be -Infinity and +Infinity, covering all real values.

Note that the result may be different every time you run it, since the sample strategy behind it is non-deterministic.

When dataset is provided for an estimator transformer, the function internally calls ml_fit() against dataset. Hence, the methods for spark_connection and ml_pipeline will then return a ml_transformer and a ml_pipeline with a ml_transformer appended, respectively. When x is a tbl_spark, the estimator will be fit against dataset before transforming x.

When dataset is not specified, the constructor returns a ml_estimator, and, in the case where x is a tbl_spark, the estimator fits against x then to obtain a transformer, which is then immediately used to transform x.

See also

See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark. ft_bucketizer

Other feature transformers: ft_binarizer, ft_bucketizer, ft_chisq_selector, ft_count_vectorizer, ft_dct, ft_elementwise_product, ft_feature_hasher, ft_hashing_tf, ft_idf, ft_imputer, ft_index_to_string, ft_interaction, ft_lsh, ft_max_abs_scaler, ft_min_max_scaler, ft_ngram, ft_normalizer, ft_one_hot_encoder, ft_pca, ft_polynomial_expansion, ft_r_formula, ft_regex_tokenizer, ft_sql_transformer, ft_standard_scaler, ft_stop_words_remover, ft_string_indexer, ft_tokenizer, ft_vector_assembler, ft_vector_indexer, ft_vector_slicer, ft_word2vec