-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #287 from rstudio/sparklyr-html
Quarto/HTML cheatsheet for sparklyr
- Loading branch information
Showing
7 changed files
with
857 additions
and
8 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,11 +1,9 @@ | ||
{ | ||
"hash": "dfa5d20431126ba509f25c2253692592", | ||
"hash": "8ebe6bf2cb68f3277c06277a66ff6e8c", | ||
"result": { | ||
"engine": "knitr", | ||
"markdown": "---\ntitle: \"Data science in Spark with sparklyr :: Cheatsheet\"\ndescription: \" \"\nimage-alt: \"\"\nexecute:\n eval: true\n output: false\n warning: false\n---\n\n::: {.cell .column-margin}\n<img src=\"images/logo-sparklyr.png\" height=\"138\" alt=\"Hex logo for sparklyr - Neon shooting stars of various shapes and sizes flying across a black and grey background.\" />\n<br><br><a href=\"../sparklyr.pdf\">\n<p><i class=\"bi bi-file-pdf\"></i> Download PDF</p>\n<img src=\"../pngs/sparklyr.png\" width=\"200\" alt=\"\"/>\n</a>\n<br><br><p>Translations (PDF)</p>\n* <a href=\"../translations/chinese/sparklyr_zh_cn.pdf\"><i class=\"bi bi-file-pdf\"></i>Chinese</a>\n* <a href=\"../translations/chinese/sparklyr_zh_tw.pdf\"><i class=\"bi bi-file-pdf\"></i>Chinese</a>\n* <a href=\"../translations/german/sparklyr_de.pdf\"><i class=\"bi bi-file-pdf\"></i>German</a>\n* <a href=\"../translations/japanese/sparklyr_ja.pdf\"><i class=\"bi bi-file-pdf\"></i>Japanese</a>\n* <a href=\"../translations/spanish/sparklyr_es.pdf\"><i class=\"bi bi-file-pdf\"></i>Spanish</a>\n:::\n\n\n\n\n<!-- Page 1 -->\n\nHTML version coming soon!\nThe PDF is available to download [here](../sparklyr.pdf).\n", | ||
"supporting": [ | ||
"sparklyr_files" | ||
], | ||
"markdown": "---\ntitle: \"Data science in Spark with sparklyr :: Cheatsheet\"\ndescription: \" \"\nimage-alt: \"\"\nexecute:\n eval: false\n output: false\n warning: false\n---\n\n::: {.cell .column-margin}\n\n:::\n\n\n\n<!-- Page 1 -->\n\n![](images/sparklyr-ds-workflow.png)\n\n# Connect\n\n## Databricks Connect (v2)\n\nSupported in Databricks Connect v2\n\n1. Open your .Renviron file: `usethis::edit_r_environ()`\n\n2. In the .Renviron file add your Databricks Host Url and Token (PAT):\n\n - `DATABRICKS_HOST = \\[Your Host URL\\]`\n - `DATABRICKS_TOKEN = \\[Your PAT\\]`\n\n3. Install extension: `install.packages(\"pysparklyr\")`\n\n4. Open connection:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsc <- spark_connect(\n cluster_id = \"[Your cluster’s ID]\",\n method = \"databricks_connect\"\n)\n```\n:::\n\n\n\n## Standalone cluster\n\n1. Install RStudio Server on one of the existing nodes or a server in the same LAN\n\n2. Open a connection\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nspark_connect(\n master=\"spark://host:port\",\n version = \"3.2\",\n spark_home = [path to Spark]\n)\n```\n:::\n\n\n\n## Yarn client\n\n1. Install RStudio Server on an edge node\n\n2. Locate path to the clusterʼs Spark Home Directory, it normally is `\"/usr/lib/spark\"`\n\n3. Basic configuration example\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nconf <- spark_config()\nconf$spark.executor.memory <- \"300M\"\nconf$spark.executor.cores <- 2\nconf$spark.executor.instances <- 3\nconf$spark.dynamicAllocation.enabled<-\"false\"\n```\n:::\n\n\n\n4. Open a connection\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsc <- spark_connect(\n master = \"yarn\",\n spark_home = \"/usr/lib/spark/\",\n version = \"2.1.0\", config = conf\n)\n```\n:::\n\n\n\n## Yarn cluster\n\n1. Make sure to have copies of the **yarn-site.xml** and **hive-site.xml** files in the RStudio Server\n\n2. Point environment variables to the correct paths\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nSys.setenv(JAVA_HOME=\"[Path]\")\nSys.setenv(SPARK_HOME =\"[Path]\")\nSys.setenv(YARN_CONF_DIR =\"[Path]\")\n```\n:::\n\n\n\n3. Open a connection\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsc <- spark_connect(master = \"yarn-cluster\")\n```\n:::\n\n\n\n## Kubernetes\n\n1. Use the following to obtain the Host and Port `system2(\"kubectl\", \"cluster-info\")`\n\n2. Open a connection\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsc <- spark_connect(\n config = spark_config_kubernetes(\n \"k8s://https://[HOST]>:[PORT]\",\n account = \"default\",\n image = \"docker.io/owner/repo:version\"\n )\n )\n```\n:::\n\n\n\n## Local mode\n\nNo cluster required.\nUse for learning purposes only\n\n1. Install a local version of Spark: spark_install()\n\n2. Open a connection\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsc <- spark_connect(master=\"local\") \n```\n:::\n\n\n\n## Cloud\n\n**Azure** - `spark_connect(method = \"synapse\")`\n\n**Qubole** - `spark_connect(method = \"qubole\")`\n\n# Import\n\n![](images/sparklyr-import.png){fig-alt=\"R push complete to Spark back to R collecting results. Also import from Source to Spark.\" fig-align=\"center\" width=\"501\"}\n\nImport data into Spark, not R\n\n## Read a file into Spark\n\n**Arguments that apply to all functions:**\n\nsc, name, path, options=list(), repartition=0, memory=TRUE, overwrite=TRUE\n\n- CSV: `spark_read_csv(header = TRUE, columns = NULL, infer_schema = TRUE, delimiter = \",\", quote= \"\\\"\", escape = \"\\\\\", charset = \"UTF-8\", null_value = NULL)`\n- JSON: `spark_read_json()`\n- PARQUET: `spark_read_parquet()`\n- TEXT: `spark_read_text()`\n- DELTA: `spark_read_delta()`\n\n## From a table\n\n- `dplyr::tbl(scr, ...)` - Creates a reference to the table without loading its data into memory\n\n- `dbplyr::in_catalog()` - Enables a three part table address\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- tbl(sc,in_catalog(\"catalog\", \"schema\", \"table\"))\n```\n:::\n\n\n\n## R data frame into Spark\n\nSupported in Databricks Connect v2\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndplyr::copy_to(dest, df, name)\n```\n:::\n\n\n\nApache Arrow accelerates data transfer between R and Spark.\nTo use, simply load the library\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(sparklyr)\nlibrary(arrow)\n```\n:::\n\n\n\n# Wrangle\n\n## dplyr verbs\n\nSupported in Databricks Connect v2\n\nTranslates into Spark SQL statements\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncopy_to(sc, mtcars) %>%\n mutate(trm = ifelse(am == 0, \"auto\", \"man\")) %>%\n group_by(trm) %>%\n summarise_all(mean)\n```\n:::\n\n\n\n## tidyr\n\n- `pivot_longer()` - Collapse several columns into two.\n (Supported in Databricks Connect v2)\n\n- `pivot_wider()` - Expand two columns into several.\n (Supported in Databricks Connect v2)\n\n- `nest()` / `unnest()` - Convert groups of cells into list-columns, and vice versa.\n\n- `unite()` / `separate()` - Split a single column into several columns, and vice versa.\n\n- `fill()` - Fill NA with the previous value.\n\n## Feature transformers\n\n- `ft_binarizer()` - Assigned values based on threshold\n- `ft_bucketizer()` - Numeric column to discretized column\n- `ft_count_vectorizer()` - Extracts a vocabulary from document\n- `ft_discrete_cosine_transform()` - 1D discrete cosine transform of a real vector\n- `ft_elementwise_product()` - Element- wise product between 2 cols\n- `ft_hashing_tf()` - Maps a sequence of terms to their term frequencies using the hashing trick.\n- `ft_idf()` - Compute the Inverse Document Frequency (IDF) given a collection of documents.\n- `ft_imputer()` - Imputation estimator for completing missing values, uses the mean or the median of the columns.\n- `ft_index_to_string()` - Index labels back to label as strings\n- `ft_interaction()` - Takes in Double and Vector columns and outputs a flattened vector of their feature interactions.\n- `ft_max_abs_scaler()` - Rescale each feature individually to range \\[-1, 1\\] (Supported in Databricks Connect v2)\n- `ft_min_max_scaler()` - Rescale each feature to a common range \\[min, max\\] linearly\n- `ft_ngram()` - Converts the input array of strings into an array of n-grams\n- `ft_bucketed_random_projection_lsh()`\n- `ft_minhash_lsh()` - Locality Sensitive Hashing functions for Euclidean distance and Jaccard distance (MinHash)\n- `ft_normalizer()` - Normalize a vector to have unit norm using the given p-norm\n- `ft_one_hot_encoder()` - Continuous to binary vectors\n- `ft_pca()` - Project vectors to a lower dimensional space of top k principal components\n- `ft_quantile_discretizer()` - Continuous to binned categorical values.\n- `ft_regex_tokenizer()` - Extracts tokens either by using the provided regex pattern to split the text\n- `ft_robust_scaler()` - Removes the median and scales according to standard scale\n- `ft_standard_scaler()` - Removes the mean and scaling to unit variance using column summary statistics (Supported in Databricks Connect v2)\n- `ft_stop_words_remover()` - Filters out stop words from input\n- `ft_string_indexer()` - Column of labels into a column of label indices.\n- `ft_tokenizer()` - Converts to lowercase and then splits it by white spaces\n- `ft_vector_assembler()` - Combine vectors into single row-vector\n- `ft_vector_indexer()` - Indexing categorical feature columns in a dataset of Vector\n- `ft_vector_slicer()` - Takes a feature vector and outputs a new feature vector with a subarray of the original features\n- `ft_word2vec()` - Word2Vec transforms a word into a code\n\n<!-- Page 2 -->\n\n# Visualize\n\n## dplyr + ggplot2\n\nSupported in Databricks Connect v2\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncopy_to(sc, mtcars) %>%\n group_by(cyl) %>%\n summarise(mpg_m = mean(mpg)) %>% # Summarize in Spark\n collect() %>% # Collect results in R\n ggplot() +\n geom_col(aes(cyl, mpg_m)) # Create plot\n```\n:::\n\n\n\n# Modeling\n\n## Regression\n\n- `ml_linear_regression()` - Linear regression.\n- `ml_aft_survival_regression()` - Parametric survival regression model named accelerated failure time (AFT) model.\n- `ml_generalized_linear_regression()` - GLM.\n- `ml_isotonic_regression()` - Uses parallelized pool adjacent violators algorithm.\n- `ml_random_forest_regressor()` - Regression using random forests.\n\n## Classification\n\n- `ml_linear_svc()` - Classification using linear support vector machines.\n- `ml_logistic_regression()` - Logistic regression. (Supported in Databricks Connect v2)\n- `ml_multilayer_perceptron_classifier()` - Based on the Multilayer Perceptron.\n- `ml_naive_bayes()` - It supports Multinomial NB which can handle finitely supported discrete data.\n- `ml_one_vs_rest()` - Reduction of Multiclass, performs reduction using one against all strategy.\n\n## Tree\n\n- `ml_decision_tree_classifier()`, `ml_decision_tree()`, \\`ml_decision_tree_regressor(). - Classification and regression using decision trees.\n- `ml_gbt_classifier()`, `ml_gradient_boosted_trees()`, `ml_gbt_regressor()` - Binary classification and regression using gradient boosted trees.\n- `ml_random_forest_classifier()` - Classification and regression using random forests.\n- `ml_feature_importances()`, `ml_tree_feature_importance()` - Feature Importance for Tree Models.\n\n## Clustering\n\n- `ml_bisecting_kmeans()` - A bisecting k-means algorithm based on the paper.\n- `ml_lda()`, `ml_describe_topics()`, `ml_log_likelihood()`, `ml_log_perplexity()`, `ml_topics_matrix()` - LDA topic model designed for text documents.\n- `ml_gaussian_mixture()` - Expectation maximization for multivariate Gaussian Mixture Models (GMMs).\n- `ml_kmeans()`, `ml_compute_cost()`, `ml_compute_silhouette_measure()` - Clustering with support for k-means.\n- `ml_power_iteration()` - For clustering vertices of a graph given pairwise similarities as edge properties.\n\n## Recommendation\n\n- `ml_als()`, `ml_recommend()` - Recommendation using Alternating Least Squares matrix factorization.\n\n## Evaluation\n\n- `ml_clustering_evaluator()` - Evaluator for clustering.\n- `ml_evaluate()` - Compute performance metrics.\n- `ml_binary_classification_evaluator()`, `ml_binary_classification_eval()`, `ml_classification_eval()` - A set of functions to calculate performance metrics for prediction models.\n\n## Frequent pattern\n\n- `ml_fpgrowth()`, `ml_association_rules()`, `ml_freq_itemsets()` - A parallel FP-growth algorithm to mine frequent itemsets.\n- `ml_freq_seq_patterns()`, `ml_prefixspan()` - PrefixSpan algorithm for mining frequent itemsets.\n\n## Stats\n\n- `ml_summary()` - Extracts a metric from the summary object of a Spark ML model.\n- `ml_corr()` - Compute correlation matrix.\n\n## Feature\n\n- `ml_chisquare_test(x,features,label)` - Pearson's independence test for every feature against the label.\n- `ml_default_stop_words()` - Loads the default stop words for the given language.\n\n## Utilities\n\n- `ml_call_constructor()` - Identifies the associated sparklyr ML constructor for the JVM.\n- `ml_model_data()` - Extracts data associated with a Spark ML model.\n- `ml_standardize_formula()` - Generates a formula string from user inputs.\n- `ml_uid()` - Extracts the UID of an ML object.\n\n# ML pipelines\n\nEasily create a formal Spark Pipeline models using R.\nSave the Pipeline in native Sacala.\nIt will have **no dependencies** on R.\n\n## Initialize and train\n\nSupported in Databricks Connect v2\n\n- `ml_pipeline()` - Initializes a new Spark Pipeline.\n- `ml_fit()` - Trains the model, outputs a Spark Pipeline Model.\n\n## Save and retrieve\n\nSupported in Databricks Connect v2\n\n- `ml_save()` - Saves into a format that can be read by Scala and PySpark.\n- `ml_read()` - Reads Spark object into sparklyr.\n\n![](images/sparklyr-save-retrieve.png){fig-alt=\"ml_pipeline() to ft_dplyr_transformer to ft_bucketizer() to ml_linear_regression() to ml_fit() to ml_save().\" fig-align=\"center\" width=\"501\"}\n\n[spark.posit.co/guides/pipelines](https://spark.posit.co/guides/pipelines)\n\n# Distributed R\n\nSupported in Databricks Connect v2\n\nRun arbitrary R code at scale inside your cluster with spark_apply().\nUseful when there you need functionality only available in R, and to solve 'embarrassingly parallel problems'.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nspark_apply(\n x, f, \n columns = NULL, memory = TRUE, group_by = NULL, \n name = NULL, barrier = NULL, fetch_result_as_sdf = TRUE\n )\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ncopy_to(sc, mtcars) %>%\n spark_apply(\n nrow, # R only function\n group_by = \"am\", \n columns = \"am double, x long\"\n )\n```\n:::\n\n\n\n------------------------------------------------------------------------\n\nCC BY SA Posit Software, PBC • [info\\@posit.co](mailto:[email protected]) • [posit.co](https://posit.co)\n\nLearn more at [spark.posit.co](https://spark.posit.co/) and [therinspark.com](https://therinspark.com/).\n\nUpdated: 2024-05.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\npackageVersion(\"sparklyr\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] '1.8.6'\n```\n\n\n:::\n:::\n\n\n\n------------------------------------------------------------------------\n", | ||
"supporting": [], | ||
"filters": [ | ||
"rmarkdown/pagebreak.lua" | ||
], | ||
|
Oops, something went wrong.