Perform Weighted Random Sampling on a Spark DataFrame


Draw a random sample of rows (with or without replacement) from a Spark DataFrame If the sampling is done without replacement, then it will be conceptually equivalent to an iterative process such that in each step the probability of adding a row to the sample set is equal to its weight divided by summation of weights of all rows that are not in the sample set yet in that step.


sdf_weighted_sample(x, weight_col, k, replacement = TRUE, seed = NULL)


Argument Description
x An object coercable to a Spark DataFrame.
weight_col Name of the weight column
k Sample set size
replacement Whether to sample with replacement
seed An (optional) integer seed

See Also

Other Spark data frames: sdf_copy_to(), sdf_distinct(), sdf_random_split(), sdf_register(), sdf_sample(), sdf_sort()