Compute (Approximate) Quantiles with a Spark DataFrame

R/sdf_interface.R

sdf_quantile

Description

Given a numeric column within a Spark DataFrame, compute approximate quantiles.

Usage

sdf_quantile( 
  x, 
  column, 
  probabilities = c(0, 0.25, 0.5, 0.75, 1), 
  relative.error = 1e-05, 
  weight.column = NULL 
) 

Arguments

Arguments Description
x A spark_connection, ml_pipeline, or a tbl_spark.
column The column(s) for which quantiles should be computed. Multiple columns are only supported in Spark 2.0+.
probabilities A numeric vector of probabilities, for which quantiles should be computed.
relative.error The maximal possible difference between the actual percentile of a result and its expected percentile (e.g., if relative.error is 0.01 and probabilities is 0.95, then any value between the 94th and 96th percentile will be considered an acceptable approximation).
weight.column If not NULL, then a generalized version of the Greenwald- Khanna algorithm will be run to compute weighted percentiles, with each sample from column having a relative weight specified by the corresponding value in weight.column. The weights can be considered as relative frequencies of sample data points.