Spark Operations

Function(s) Description
get_spark_sql_catalog_implementation() Retrieve the Spark connection’s SQL catalog implementation property
spark_config() Read Spark Configuration
spark_connect() spark_connection_is_open() spark_disconnect() spark_disconnect_all() spark_submit() Manage Spark Connections
spark_install() spark_uninstall() spark_install_dir() spark_install_tar() spark_installed_versions() spark_available_versions() Download and install various versions of Spark
spark_log() View Entries in the Spark Log
spark_web() Open the Spark web interface
connection_is_open() Check whether the connection is open
connection_spark_shinyapp() A Shiny app that can be used to construct a spark_connect statement
spark_session_config() Runtime configuration interface for the Spark Session
spark_set_checkpoint_dir() spark_get_checkpoint_dir() Set/Get Spark checkpoint directory
spark_table_name() Generate a Table Name from Expression
spark_version_from_home() Get the Spark Version Associated with a Spark Installation
spark_versions() Retrieves a dataframe available Spark versions that van be installed.
spark_config_kubernetes() Kubernetes Configuration
spark_config_settings() Retrieve Available Settings
spark_connection_find() Find Spark Connection
spark_dependency_fallback() Fallback to Spark Dependency
spark_extension() Create Spark Extension
spark_load_table() Reads from a Spark Table into a Spark DataFrame.
list_sparklyr_jars() list all sparklyr-*.jar files that have been built
spark_config_packages() Creates Spark Configuration
spark_connection() Retrieve the Spark Connection Associated with an R Object
spark_adaptive_query_execution() Retrieves or sets status of Spark AQE
spark_advisory_shuffle_partition_size() Retrieves or sets advisory size of the shuffle partition
spark_auto_broadcast_join_threshold() Retrieves or sets the auto broadcast join threshold
spark_coalesce_initial_num_partitions() Retrieves or sets initial number of shuffle partitions before coalescing
spark_coalesce_min_num_partitions() Retrieves or sets the minimum number of shuffle partitions after coalescing
spark_coalesce_shuffle_partitions() Retrieves or sets whether coalescing contiguous shuffle partitions is enabled

Spark Data

Function(s) Description
spark_read() Read file(s) into a Spark DataFrame using a custom reader
spark_read_avro() Read Apache Avro data into a Spark DataFrame.
spark_read_binary() Read binary data into a Spark DataFrame.
spark_read_csv() Read a CSV file into a Spark DataFrame
spark_read_delta() Read from Delta Lake into a Spark DataFrame.
spark_read_image() Read image data into a Spark DataFrame.
spark_read_jdbc() Read from JDBC connection into a Spark DataFrame.
spark_read_json() Read a JSON file into a Spark DataFrame
spark_read_libsvm() Read libsvm file into a Spark DataFrame.
spark_read_parquet() Read a Parquet file into a Spark DataFrame
spark_read_source() Read from a generic source into a Spark DataFrame.
spark_read_table() Reads from a Spark Table into a Spark DataFrame.
spark_read_orc() Read a ORC file into a Spark DataFrame
spark_read_text() Read a Text file into a Spark DataFrame
spark_save_table() Saves a Spark DataFrame as a Spark table
spark_write() Write Spark DataFrame to file using a custom writer
spark_write_avro() Serialize a Spark DataFrame into Apache Avro format
spark_write_orc() Write a Spark DataFrame to a ORC file
spark_write_text() Write a Spark DataFrame to a Text file
spark_write_csv() Write a Spark DataFrame to a CSV
spark_write_delta() Writes a Spark DataFrame into Delta Lake
spark_write_jdbc() Writes a Spark DataFrame into a JDBC table
spark_write_json() Write a Spark DataFrame to a JSON file
spark_write_parquet() Write a Spark DataFrame to a Parquet file
spark_write_source() Writes a Spark DataFrame into a generic source
spark_write_table() Writes a Spark DataFrame into a Spark table
spark_write_rds() Write Spark DataFrame to RDS files
collect_from_rds() Collect Spark data serialized in RDS format into R

Spark Tables

Function(s) Description
src_databases() Show database list
tbl_cache() Cache a Spark Table
tbl_change_db() Use specific database
tbl_uncache() Uncache a Spark Table

Spark DataFrames

Function(s) Description
[(<tbl_spark>) Subsetting operator for Spark dataframe
copy_to(<spark_connection>) Copy an R Data Frame to Spark
sdf_along() Create DataFrame for along Object
sdf_bind_rows() sdf_bind_cols() Bind multiple Spark DataFrames by row and column
sdf_broadcast() Broadcast hint
sdf_checkpoint() Checkpoint a Spark DataFrame
sdf_coalesce() Coalesces a Spark DataFrame
sdf_copy_to() sdf_import() Copy an Object into Spark
sdf_distinct() Invoke distinct on a Spark DataFrame
sdf_drop_duplicates() Remove duplicates from a Spark DataFrame
sdf_expand_grid() Create a Spark dataframe containing all combinations of inputs
sdf_from_avro() Convert column(s) from avro format
sdf_len() Create DataFrame for Length
sdf_num_partitions() Gets number of partitions of a Spark DataFrame
sdf_random_split() sdf_partition() Partition a Spark Dataframe
sdf_partition_sizes() Compute the number of records within each partition of a Spark DataFrame
sdf_pivot() Pivot a Spark DataFrame
sdf_predict() sdf_transform() sdf_fit() sdf_fit_and_transform() Spark ML – Transform, fit, and predict methods (sdf_ interface)
sdf_rbeta() Generate random samples from a Beta distribution
sdf_rbinom() Generate random samples from a binomial distribution
sdf_rcauchy() Generate random samples from a Cauchy distribution
sdf_rchisq() Generate random samples from a chi-squared distribution
sdf_rexp() Generate random samples from an exponential distribution
sdf_rgamma() Generate random samples from a Gamma distribution
sdf_rgeom() Generate random samples from a geometric distribution
sdf_rhyper() Generate random samples from a hypergeometric distribution
sdf_rlnorm() Generate random samples from a log normal distribution
sdf_rnorm() Generate random samples from the standard normal distribution
sdf_rpois() Generate random samples from a Poisson distribution
sdf_rt() Generate random samples from a t-distribution
sdf_runif() Generate random samples from the uniform distribution U(0, 1).
sdf_rweibull() Generate random samples from a Weibull distribution.
sdf_read_column() Read a Column from a Spark DataFrame
sdf_register() Register a Spark DataFrame
sdf_repartition() Repartition a Spark DataFrame
sdf_residuals() Model Residuals
sdf_sample() Randomly Sample Rows from a Spark DataFrame
sdf_separate_column() Separate a Vector Column into Scalar Columns
sdf_seq() Create DataFrame for Range
sdf_sort() Sort a Spark DataFrame
sdf_to_avro() Convert column(s) to avro format
sdf_with_unique_id() Add a Unique ID Column to a Spark DataFrame
sdf_collect() Collect a Spark DataFrame into R.
sdf_crosstab() Cross Tabulation
sdf_debug_string() Debug Info for Spark DataFrame
sdf_describe() Compute summary statistics for columns of a data frame
sdf_dim() sdf_nrow() sdf_ncol() Support for Dimension Operations
sdf_is_streaming() Spark DataFrame is Streaming
sdf_last_index() Returns the last index of a Spark DataFrame
sdf_save_table() sdf_load_table() sdf_save_parquet() sdf_load_parquet() Save / Load a Spark DataFrame
sdf_persist() Persist a Spark DataFrame
sdf_project() Project features onto principal components
sdf_quantile() Compute (Approximate) Quantiles with a Spark DataFrame
sdf_schema() Read the Schema of a Spark DataFrame
sdf_sql() Spark DataFrame from SQL
sdf_unnest_longer() Unnest longer
sdf_unnest_wider() Unnest wider
sdf_with_sequential_id() Add a Sequential ID Column to a Spark DataFrame
inner_join(<tbl_spark>) left_join(<tbl_spark>) right_join(<tbl_spark>) full_join(<tbl_spark>) Join Spark tbls.
hof_aggregate() Apply Aggregate Function to Array Column
hof_array_sort() Sorts array using a custom comparator
hof_exists() Determine Whether Some Element Exists in an Array Column
hof_filter() Filter Array Column
hof_forall() Checks whether all elements in an array satisfy a predicate
hof_map_filter() Filters a map
hof_map_zip_with() Merges two maps into one
hof_transform() Transform Array Column
hof_transform_keys() Transforms keys of a map
hof_transform_values() Transforms values of a map
hof_zip_with() Combines 2 Array Columns
sdf_weighted_sample() Perform Weighted Random Sampling on a Spark DataFrame
transform_sdf() transform a subset of column(s) in a Spark Dataframe

Spark Machine Learning

Function(s) Description
ml_decision_tree_classifier() ml_decision_tree() ml_decision_tree_regressor() Spark ML – Decision Trees
ml_generalized_linear_regression() Spark ML – Generalized Linear Regression
ml_gbt_classifier() ml_gradient_boosted_trees() ml_gbt_regressor() Spark ML – Gradient Boosted Trees
ml_kmeans() ml_compute_cost() ml_compute_silhouette_measure() Spark ML – K-Means Clustering
ml_lda() ml_describe_topics() ml_log_likelihood() ml_log_perplexity() ml_topics_matrix() Spark ML – Latent Dirichlet Allocation
ml_linear_regression() Spark ML – Linear Regression
ml_logistic_regression() Spark ML – Logistic Regression
ml_model_data() Extracts data associated with a Spark ML model
ml_multilayer_perceptron_classifier() ml_multilayer_perceptron() Spark ML – Multilayer Perceptron
ml_naive_bayes() Spark ML – Naive-Bayes
ml_one_vs_rest() Spark ML – OneVsRest
ft_pca() ml_pca() Feature Transformation – PCA (Estimator)
ml_prefixspan() ml_freq_seq_patterns() Frequent Pattern Mining – PrefixSpan
ml_random_forest_classifier() ml_random_forest() ml_random_forest_regressor() Spark ML – Random Forest
ml_aft_survival_regression() ml_survival_regression() Spark ML – Survival Regression
ml_add_stage() Add a Stage to a Pipeline
ml_als() ml_recommend() Spark ML – ALS
ml_approx_nearest_neighbors() ml_approx_similarity_join() Utility functions for LSH models
ml_fpgrowth() ml_association_rules() ml_freq_itemsets() Frequent Pattern Mining – FPGrowth
ml_binary_classification_evaluator() ml_binary_classification_eval() ml_multiclass_classification_evaluator() ml_classification_eval() ml_regression_evaluator() Spark ML - Evaluators
ml_bisecting_kmeans() Spark ML – Bisecting K-Means Clustering
ml_call_constructor() Wrap a Spark ML JVM object
ml_chisquare_test() Chi-square hypothesis testing for categorical data.
ml_clustering_evaluator() Spark ML - Clustering Evaluator
ml_supervised_pipeline() ml_clustering_pipeline() ml_construct_model_supervised() ml_construct_model_clustering() new_ml_model_prediction() new_ml_model() new_ml_model_classification() new_ml_model_regression() new_ml_model_clustering() Constructors for ml_model Objects
ml_corr() Compute correlation matrix
ml_sub_models() ml_validation_metrics() ml_cross_validator() ml_train_validation_split() Spark ML – Tuning
ml_default_stop_words() Default stop words
ml_evaluate() Evaluate the Model on a Validation Set
ml_feature_importances() ml_tree_feature_importance() Spark ML - Feature Importance for Tree Models
ft_word2vec() ml_find_synonyms() Feature Transformation – Word2Vec (Estimator)
is_ml_transformer() is_ml_estimator() ml_fit() ml_transform() ml_fit_and_transform() ml_predict() Spark ML – Transform, fit, and predict methods (ml_ interface)
ml_gaussian_mixture() Spark ML – Gaussian Mixture clustering.
ml_is_set() ml_param_map() ml_param() ml_params() Spark ML – ML Params
ml_isotonic_regression() Spark ML – Isotonic Regression
ft_string_indexer() ml_labels() ft_string_indexer_model() Feature Transformation – StringIndexer (Estimator)
ml_linear_svc() Spark ML – LinearSVC
ml_save() ml_load() Spark ML – Model Persistence
ml_pipeline() Spark ML – Pipelines
ml_power_iteration() Spark ML – Power Iteration Clustering
ml_stage() ml_stages() Spark ML – Pipeline stage extraction
ml_standardize_formula() Standardize Formula Input for ml_model
ml_summary() Spark ML – Extraction of summary metrics
ml_uid() Spark ML – UID
ft_count_vectorizer() ml_vocabulary() Feature Transformation – CountVectorizer (Estimator)

Spark Feature Transformers

Function(s) Description
ft_binarizer() Feature Transformation – Binarizer (Transformer)
ft_bucketizer() Feature Transformation – Bucketizer (Transformer)
ft_count_vectorizer() ml_vocabulary() Feature Transformation – CountVectorizer (Estimator)
ft_dct() ft_discrete_cosine_transform() Feature Transformation – Discrete Cosine Transform (DCT) (Transformer)
ft_elementwise_product() Feature Transformation – ElementwiseProduct (Transformer)
ft_index_to_string() Feature Transformation – IndexToString (Transformer)
ft_one_hot_encoder() Feature Transformation – OneHotEncoder (Transformer)
ft_quantile_discretizer() Feature Transformation – QuantileDiscretizer (Estimator)
ft_sql_transformer() ft_dplyr_transformer() Feature Transformation – SQLTransformer
ft_string_indexer() ml_labels() ft_string_indexer_model() Feature Transformation – StringIndexer (Estimator)
ft_vector_assembler() Feature Transformation – VectorAssembler (Transformer)
ft_tokenizer() Feature Transformation – Tokenizer (Transformer)
ft_regex_tokenizer() Feature Transformation – RegexTokenizer (Transformer)
ft_bucketed_random_projection_lsh() ft_minhash_lsh() Feature Transformation – LSH (Estimator)
ft_chisq_selector() Feature Transformation – ChiSqSelector (Estimator)
ft_feature_hasher() Feature Transformation – FeatureHasher (Transformer)
ft_hashing_tf() Feature Transformation – HashingTF (Transformer)
ft_idf() Feature Transformation – IDF (Estimator)
ft_imputer() Feature Transformation – Imputer (Estimator)
ft_interaction() Feature Transformation – Interaction (Transformer)
ft_max_abs_scaler() Feature Transformation – MaxAbsScaler (Estimator)
ft_min_max_scaler() Feature Transformation – MinMaxScaler (Estimator)
ft_ngram() Feature Transformation – NGram (Transformer)
ft_normalizer() Feature Transformation – Normalizer (Transformer)
ft_one_hot_encoder_estimator() Feature Transformation – OneHotEncoderEstimator (Estimator)
ft_pca() ml_pca() Feature Transformation – PCA (Estimator)
ft_polynomial_expansion() Feature Transformation – PolynomialExpansion (Transformer)
ft_r_formula() Feature Transformation – RFormula (Estimator)
ft_standard_scaler() Feature Transformation – StandardScaler (Estimator)
ft_stop_words_remover() Feature Transformation – StopWordsRemover (Transformer)
ft_vector_indexer() Feature Transformation – VectorIndexer (Estimator)
ft_vector_slicer() Feature Transformation – VectorSlicer (Transformer)
ft_word2vec() ml_find_synonyms() Feature Transformation – Word2Vec (Estimator)
ft_robust_scaler() Feature Transformation – RobustScaler (Estimator)

Spark Machine Learning Utilities

Function(s) Description
ml_binary_classification_evaluator() ml_binary_classification_eval() ml_multiclass_classification_evaluator() ml_classification_eval() ml_regression_evaluator() Spark ML - Evaluators
ml_feature_importances() ml_tree_feature_importance() Spark ML - Feature Importance for Tree Models
tidy(<ml_model_als>) augment(<ml_model_als>) glance(<ml_model_als>) Tidying methods for Spark ML ALS
tidy(<ml_model_generalized_linear_regression>) tidy(<ml_model_linear_regression>) augment(<ml_model_generalized_linear_regression>) augment(<ml_model_linear_regression>) glance(<ml_model_generalized_linear_regression>) glance(<ml_model_linear_regression>) Tidying methods for Spark ML linear models
tidy(<ml_model_isotonic_regression>) augment(<ml_model_isotonic_regression>) glance(<ml_model_isotonic_regression>) Tidying methods for Spark ML Isotonic Regression
tidy(<ml_model_lda>) augment(<ml_model_lda>) glance(<ml_model_lda>) Tidying methods for Spark ML LDA models
tidy(<ml_model_linear_svc>) augment(<ml_model_linear_svc>) glance(<ml_model_linear_svc>) Tidying methods for Spark ML linear svc
tidy(<ml_model_logistic_regression>) augment(<ml_model_logistic_regression>) glance(<ml_model_logistic_regression>) Tidying methods for Spark ML Logistic Regression
tidy(<ml_model_multilayer_perceptron_classification>) augment(<ml_model_multilayer_perceptron_classification>) glance(<ml_model_multilayer_perceptron_classification>) Tidying methods for Spark ML MLP
tidy(<ml_model_naive_bayes>) augment(<ml_model_naive_bayes>) glance(<ml_model_naive_bayes>) Tidying methods for Spark ML Naive Bayes
tidy(<ml_model_pca>) augment(<ml_model_pca>) glance(<ml_model_pca>) Tidying methods for Spark ML Principal Component Analysis
tidy(<ml_model_aft_survival_regression>) augment(<ml_model_aft_survival_regression>) glance(<ml_model_aft_survival_regression>) Tidying methods for Spark ML Survival Regression
tidy(<ml_model_decision_tree_classification>) tidy(<ml_model_decision_tree_regression>) augment(<ml_model_decision_tree_classification>) augment(<ml_model_decision_tree_regression>) glance(<ml_model_decision_tree_classification>) glance(<ml_model_decision_tree_regression>) tidy(<ml_model_random_forest_classification>) tidy(<ml_model_random_forest_regression>) augment(<ml_model_random_forest_classification>) augment(<ml_model_random_forest_regression>) glance(<ml_model_random_forest_classification>) glance(<ml_model_random_forest_regression>) tidy(<ml_model_gbt_classification>) tidy(<ml_model_gbt_regression>) augment(<ml_model_gbt_classification>) augment(<ml_model_gbt_regression>) glance(<ml_model_gbt_classification>) glance(<ml_model_gbt_regression>) Tidying methods for Spark ML tree models
tidy(<ml_model_kmeans>) augment(<ml_model_kmeans>) glance(<ml_model_kmeans>) tidy(<ml_model_bisecting_kmeans>) augment(<ml_model_bisecting_kmeans>) glance(<ml_model_bisecting_kmeans>) tidy(<ml_model_gaussian_mixture>) augment(<ml_model_gaussian_mixture>) glance(<ml_model_gaussian_mixture>) Tidying methods for Spark ML unsupervised models


Function(s) Description
compile_package_jars() Compile Scala sources into a Java Archive (jar)
connection_config() Read configuration values for a connection
download_scalac() Downloads default Scala Compilers
find_scalac() Discover the Scala Compiler
spark_context() java_context() hive_context() spark_session() Access the Spark API
hive_context_config() Runtime configuration interface for Hive
invoke() invoke_static() invoke_new() Invoke a Method on a JVM Object
j_invoke() j_invoke_static() j_invoke_new() Invoke a Java function.
jarray() Instantiate a Java array with a specific element type.
jfloat() Instantiate a Java float type.
jfloat_array() Instantiate an Array[Float].
register_extension() registered_extensions() Register a Package that Implements a Spark Extension
spark_compilation_spec() Define a Spark Compilation Specification
spark_default_compilation_spec() Default Compilation Specification for Spark Extensions
spark_context_config() Runtime configuration interface for the Spark Context.
spark_dataframe() Retrieve a Spark DataFrame
spark_dependency() Define a Spark dependency
spark_home_set() Set the SPARK_HOME environment variable
spark_jobj() Retrieve a Spark JVM Object Reference
spark_version() Get the Spark Version Associated with a Spark Connection

Distributed Computing

Function(s) Description
spark_apply() Apply an R Function in Spark
spark_apply_bundle() Create Bundle for Spark Apply
spark_apply_log() Log Writer for Spark Apply
registerDoSpark() Register a Parallel Backend


Function(s) Description
livy_install() livy_available_versions() livy_install_dir() livy_installed_versions() livy_home_dir() Install Livy
livy_config() Create a Spark Configuration for Livy
livy_service_start() livy_service_stop() Start Livy


Function(s) Description
stream_find() Find Stream
stream_generate_test() Generate Test Stream
stream_id() Spark Stream’s Identifier
stream_lag() Apply lag function to columns of a Spark Streaming DataFrame
stream_name() Spark Stream’s Name
stream_read_csv() Read CSV Stream
stream_read_json() Read JSON Stream
stream_read_delta() Read Delta Stream
stream_read_kafka() Read Kafka Stream
stream_read_orc() Read ORC Stream
stream_read_parquet() Read Parquet Stream
stream_read_socket() Read Socket Stream
stream_read_text() Read Text Stream
stream_render() Render Stream
stream_stats() Stream Statistics
stream_stop() Stops a Spark Stream
stream_trigger_continuous() Spark Stream Continuous Trigger
stream_trigger_interval() Spark Stream Interval Trigger
stream_view() View Stream
stream_watermark() Watermark Stream
stream_write_console() Write Console Stream
stream_write_csv() Write CSV Stream
stream_write_delta() Write Delta Stream
stream_write_json() Write JSON Stream
stream_write_kafka() Write Kafka Stream
stream_write_memory() Write Memory Stream
stream_write_orc() Write a ORC Stream
stream_write_parquet() Write Parquet Stream
stream_write_text() Write Text Stream
reactiveSpark() Reactive spark reader