`R/sdf_interface.R`

`sdf_weighted_sample.Rd`

Draw a random sample of rows (with or without replacement) from a Spark DataFrame If the sampling is done without replacement, then it will be conceptually equivalent to an iterative process such that in each step the probability of adding a row to the sample set is equal to its weight divided by summation of weights of all rows that are not in the sample set yet in that step.

sdf_weighted_sample(x, weight_col, k, replacement = TRUE, seed = NULL)

x | An object coercable to a Spark DataFrame. |
---|---|

weight_col | Name of the weight column |

k | Sample set size |

replacement | Whether to sample with replacement |

seed | An (optional) integer seed |

The family of functions prefixed with `sdf_`

generally access the Scala
Spark DataFrame API directly, as opposed to the `dplyr`

interface which
uses Spark SQL. These functions will 'force' any pending SQL in a
`dplyr`

pipeline, such that the resulting `tbl_spark`

object
returned will no longer have the attached 'lazy' SQL operations. Note that
the underlying Spark DataFrame *does* execute its operations lazily, so
that even though the pending set of operations (currently) are not exposed at
the R level, these operations will only be executed when you explicitly
`collect()`

the table.

Other Spark data frames:
`sdf_copy_to()`

,
`sdf_distinct()`

,
`sdf_random_split()`

,
`sdf_register()`

,
`sdf_sample()`

,
`sdf_sort()`