Skip to content

Commit

Permalink
Merge pull request #287 from rstudio/sparklyr-html
Browse files Browse the repository at this point in the history
Quarto/HTML cheatsheet for sparklyr
  • Loading branch information
mine-cetinkaya-rundel authored May 31, 2024
2 parents 4fbbb36 + e2ee7e9 commit 483c5f7
Show file tree
Hide file tree
Showing 7 changed files with 857 additions and 8 deletions.
8 changes: 3 additions & 5 deletions _freeze/html/sparklyr/execute-results/html.json
Original file line number Diff line number Diff line change
@@ -1,11 +1,9 @@
{
"hash": "dfa5d20431126ba509f25c2253692592",
"hash": "8ebe6bf2cb68f3277c06277a66ff6e8c",
"result": {
"engine": "knitr",
"markdown": "---\ntitle: \"Data science in Spark with sparklyr :: Cheatsheet\"\ndescription: \" \"\nimage-alt: \"\"\nexecute:\n eval: true\n output: false\n warning: false\n---\n\n::: {.cell .column-margin}\n<img src=\"images/logo-sparklyr.png\" height=\"138\" alt=\"Hex logo for sparklyr - Neon shooting stars of various shapes and sizes flying across a black and grey background.\" />\n<br><br><a href=\"../sparklyr.pdf\">\n<p><i class=\"bi bi-file-pdf\"></i> Download PDF</p>\n<img src=\"../pngs/sparklyr.png\" width=\"200\" alt=\"\"/>\n</a>\n<br><br><p>Translations (PDF)</p>\n* <a href=\"../translations/chinese/sparklyr_zh_cn.pdf\"><i class=\"bi bi-file-pdf\"></i>Chinese</a>\n* <a href=\"../translations/chinese/sparklyr_zh_tw.pdf\"><i class=\"bi bi-file-pdf\"></i>Chinese</a>\n* <a href=\"../translations/german/sparklyr_de.pdf\"><i class=\"bi bi-file-pdf\"></i>German</a>\n* <a href=\"../translations/japanese/sparklyr_ja.pdf\"><i class=\"bi bi-file-pdf\"></i>Japanese</a>\n* <a href=\"../translations/spanish/sparklyr_es.pdf\"><i class=\"bi bi-file-pdf\"></i>Spanish</a>\n:::\n\n\n\n\n<!-- Page 1 -->\n\nHTML version coming soon!\nThe PDF is available to download [here](../sparklyr.pdf).\n",
"supporting": [
"sparklyr_files"
],
"markdown": "---\ntitle: \"Data science in Spark with sparklyr :: Cheatsheet\"\ndescription: \" \"\nimage-alt: \"\"\nexecute:\n eval: false\n output: false\n warning: false\n---\n\n::: {.cell .column-margin}\n\n:::\n\n\n\n<!-- Page 1 -->\n\n![](images/sparklyr-ds-workflow.png)\n\n# Connect\n\n## Databricks Connect (v2)\n\nSupported in Databricks Connect v2\n\n1. Open your .Renviron file: `usethis::edit_r_environ()`\n\n2. In the .Renviron file add your Databricks Host Url and Token (PAT):\n\n - `DATABRICKS_HOST = \\[Your Host URL\\]`\n - `DATABRICKS_TOKEN = \\[Your PAT\\]`\n\n3. Install extension: `install.packages(\"pysparklyr\")`\n\n4. Open connection:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsc <- spark_connect(\n cluster_id = \"[Your cluster’s ID]\",\n method = \"databricks_connect\"\n)\n```\n:::\n\n\n\n## Standalone cluster\n\n1. Install RStudio Server on one of the existing nodes or a server in the same LAN\n\n2. Open a connection\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nspark_connect(\n master=\"spark://host:port\",\n version = \"3.2\",\n spark_home = [path to Spark]\n)\n```\n:::\n\n\n\n## Yarn client\n\n1. Install RStudio Server on an edge node\n\n2. Locate path to the clusterʼs Spark Home Directory, it normally is `\"/usr/lib/spark\"`\n\n3. Basic configuration example\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nconf <- spark_config()\nconf$spark.executor.memory <- \"300M\"\nconf$spark.executor.cores <- 2\nconf$spark.executor.instances <- 3\nconf$spark.dynamicAllocation.enabled<-\"false\"\n```\n:::\n\n\n\n4. Open a connection\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsc <- spark_connect(\n master = \"yarn\",\n spark_home = \"/usr/lib/spark/\",\n version = \"2.1.0\", config = conf\n)\n```\n:::\n\n\n\n## Yarn cluster\n\n1. Make sure to have copies of the **yarn-site.xml** and **hive-site.xml** files in the RStudio Server\n\n2. Point environment variables to the correct paths\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nSys.setenv(JAVA_HOME=\"[Path]\")\nSys.setenv(SPARK_HOME =\"[Path]\")\nSys.setenv(YARN_CONF_DIR =\"[Path]\")\n```\n:::\n\n\n\n3. Open a connection\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsc <- spark_connect(master = \"yarn-cluster\")\n```\n:::\n\n\n\n## Kubernetes\n\n1. Use the following to obtain the Host and Port `system2(\"kubectl\", \"cluster-info\")`\n\n2. Open a connection\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsc <- spark_connect(\n config = spark_config_kubernetes(\n \"k8s://https://[HOST]>:[PORT]\",\n account = \"default\",\n image = \"docker.io/owner/repo:version\"\n )\n )\n```\n:::\n\n\n\n## Local mode\n\nNo cluster required.\nUse for learning purposes only\n\n1. Install a local version of Spark: spark_install()\n\n2. Open a connection\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsc <- spark_connect(master=\"local\") \n```\n:::\n\n\n\n## Cloud\n\n**Azure** - `spark_connect(method = \"synapse\")`\n\n**Qubole** - `spark_connect(method = \"qubole\")`\n\n# Import\n\n![](images/sparklyr-import.png){fig-alt=\"R push complete to Spark back to R collecting results. Also import from Source to Spark.\" fig-align=\"center\" width=\"501\"}\n\nImport data into Spark, not R\n\n## Read a file into Spark\n\n**Arguments that apply to all functions:**\n\nsc, name, path, options=list(), repartition=0, memory=TRUE, overwrite=TRUE\n\n- CSV: `spark_read_csv(header = TRUE, columns = NULL, infer_schema = TRUE, delimiter = \",\", quote= \"\\\"\", escape = \"\\\\\", charset = \"UTF-8\", null_value = NULL)`\n- JSON: `spark_read_json()`\n- PARQUET: `spark_read_parquet()`\n- TEXT: `spark_read_text()`\n- DELTA: `spark_read_delta()`\n\n## From a table\n\n- `dplyr::tbl(scr, ...)` - Creates a reference to the table without loading its data into memory\n\n- `dbplyr::in_catalog()` - Enables a three part table address\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- tbl(sc,in_catalog(\"catalog\", \"schema\", \"table\"))\n```\n:::\n\n\n\n## R data frame into Spark\n\nSupported in Databricks Connect v2\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndplyr::copy_to(dest, df, name)\n```\n:::\n\n\n\nApache Arrow accelerates data transfer between R and Spark.\nTo use, simply load the library\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(sparklyr)\nlibrary(arrow)\n```\n:::\n\n\n\n# Wrangle\n\n## dplyr verbs\n\nSupported in Databricks Connect v2\n\nTranslates into Spark SQL statements\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncopy_to(sc, mtcars) %>%\n mutate(trm = ifelse(am == 0, \"auto\", \"man\")) %>%\n group_by(trm) %>%\n summarise_all(mean)\n```\n:::\n\n\n\n## tidyr\n\n- `pivot_longer()` - Collapse several columns into two.\n (Supported in Databricks Connect v2)\n\n- `pivot_wider()` - Expand two columns into several.\n (Supported in Databricks Connect v2)\n\n- `nest()` / `unnest()` - Convert groups of cells into list-columns, and vice versa.\n\n- `unite()` / `separate()` - Split a single column into several columns, and vice versa.\n\n- `fill()` - Fill NA with the previous value.\n\n## Feature transformers\n\n- `ft_binarizer()` - Assigned values based on threshold\n- `ft_bucketizer()` - Numeric column to discretized column\n- `ft_count_vectorizer()` - Extracts a vocabulary from document\n- `ft_discrete_cosine_transform()` - 1D discrete cosine transform of a real vector\n- `ft_elementwise_product()` - Element- wise product between 2 cols\n- `ft_hashing_tf()` - Maps a sequence of terms to their term frequencies using the hashing trick.\n- `ft_idf()` - Compute the Inverse Document Frequency (IDF) given a collection of documents.\n- `ft_imputer()` - Imputation estimator for completing missing values, uses the mean or the median of the columns.\n- `ft_index_to_string()` - Index labels back to label as strings\n- `ft_interaction()` - Takes in Double and Vector columns and outputs a flattened vector of their feature interactions.\n- `ft_max_abs_scaler()` - Rescale each feature individually to range \\[-1, 1\\] (Supported in Databricks Connect v2)\n- `ft_min_max_scaler()` - Rescale each feature to a common range \\[min, max\\] linearly\n- `ft_ngram()` - Converts the input array of strings into an array of n-grams\n- `ft_bucketed_random_projection_lsh()`\n- `ft_minhash_lsh()` - Locality Sensitive Hashing functions for Euclidean distance and Jaccard distance (MinHash)\n- `ft_normalizer()` - Normalize a vector to have unit norm using the given p-norm\n- `ft_one_hot_encoder()` - Continuous to binary vectors\n- `ft_pca()` - Project vectors to a lower dimensional space of top k principal components\n- `ft_quantile_discretizer()` - Continuous to binned categorical values.\n- `ft_regex_tokenizer()` - Extracts tokens either by using the provided regex pattern to split the text\n- `ft_robust_scaler()` - Removes the median and scales according to standard scale\n- `ft_standard_scaler()` - Removes the mean and scaling to unit variance using column summary statistics (Supported in Databricks Connect v2)\n- `ft_stop_words_remover()` - Filters out stop words from input\n- `ft_string_indexer()` - Column of labels into a column of label indices.\n- `ft_tokenizer()` - Converts to lowercase and then splits it by white spaces\n- `ft_vector_assembler()` - Combine vectors into single row-vector\n- `ft_vector_indexer()` - Indexing categorical feature columns in a dataset of Vector\n- `ft_vector_slicer()` - Takes a feature vector and outputs a new feature vector with a subarray of the original features\n- `ft_word2vec()` - Word2Vec transforms a word into a code\n\n<!-- Page 2 -->\n\n# Visualize\n\n## dplyr + ggplot2\n\nSupported in Databricks Connect v2\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncopy_to(sc, mtcars) %>%\n group_by(cyl) %>%\n summarise(mpg_m = mean(mpg)) %>% # Summarize in Spark\n collect() %>% # Collect results in R\n ggplot() +\n geom_col(aes(cyl, mpg_m)) # Create plot\n```\n:::\n\n\n\n# Modeling\n\n## Regression\n\n- `ml_linear_regression()` - Linear regression.\n- `ml_aft_survival_regression()` - Parametric survival regression model named accelerated failure time (AFT) model.\n- `ml_generalized_linear_regression()` - GLM.\n- `ml_isotonic_regression()` - Uses parallelized pool adjacent violators algorithm.\n- `ml_random_forest_regressor()` - Regression using random forests.\n\n## Classification\n\n- `ml_linear_svc()` - Classification using linear support vector machines.\n- `ml_logistic_regression()` - Logistic regression. (Supported in Databricks Connect v2)\n- `ml_multilayer_perceptron_classifier()` - Based on the Multilayer Perceptron.\n- `ml_naive_bayes()` - It supports Multinomial NB which can handle finitely supported discrete data.\n- `ml_one_vs_rest()` - Reduction of Multiclass, performs reduction using one against all strategy.\n\n## Tree\n\n- `ml_decision_tree_classifier()`, `ml_decision_tree()`, \\`ml_decision_tree_regressor(). - Classification and regression using decision trees.\n- `ml_gbt_classifier()`, `ml_gradient_boosted_trees()`, `ml_gbt_regressor()` - Binary classification and regression using gradient boosted trees.\n- `ml_random_forest_classifier()` - Classification and regression using random forests.\n- `ml_feature_importances()`, `ml_tree_feature_importance()` - Feature Importance for Tree Models.\n\n## Clustering\n\n- `ml_bisecting_kmeans()` - A bisecting k-means algorithm based on the paper.\n- `ml_lda()`, `ml_describe_topics()`, `ml_log_likelihood()`, `ml_log_perplexity()`, `ml_topics_matrix()` - LDA topic model designed for text documents.\n- `ml_gaussian_mixture()` - Expectation maximization for multivariate Gaussian Mixture Models (GMMs).\n- `ml_kmeans()`, `ml_compute_cost()`, `ml_compute_silhouette_measure()` - Clustering with support for k-means.\n- `ml_power_iteration()` - For clustering vertices of a graph given pairwise similarities as edge properties.\n\n## Recommendation\n\n- `ml_als()`, `ml_recommend()` - Recommendation using Alternating Least Squares matrix factorization.\n\n## Evaluation\n\n- `ml_clustering_evaluator()` - Evaluator for clustering.\n- `ml_evaluate()` - Compute performance metrics.\n- `ml_binary_classification_evaluator()`, `ml_binary_classification_eval()`, `ml_classification_eval()` - A set of functions to calculate performance metrics for prediction models.\n\n## Frequent pattern\n\n- `ml_fpgrowth()`, `ml_association_rules()`, `ml_freq_itemsets()` - A parallel FP-growth algorithm to mine frequent itemsets.\n- `ml_freq_seq_patterns()`, `ml_prefixspan()` - PrefixSpan algorithm for mining frequent itemsets.\n\n## Stats\n\n- `ml_summary()` - Extracts a metric from the summary object of a Spark ML model.\n- `ml_corr()` - Compute correlation matrix.\n\n## Feature\n\n- `ml_chisquare_test(x,features,label)` - Pearson's independence test for every feature against the label.\n- `ml_default_stop_words()` - Loads the default stop words for the given language.\n\n## Utilities\n\n- `ml_call_constructor()` - Identifies the associated sparklyr ML constructor for the JVM.\n- `ml_model_data()` - Extracts data associated with a Spark ML model.\n- `ml_standardize_formula()` - Generates a formula string from user inputs.\n- `ml_uid()` - Extracts the UID of an ML object.\n\n# ML pipelines\n\nEasily create a formal Spark Pipeline models using R.\nSave the Pipeline in native Sacala.\nIt will have **no dependencies** on R.\n\n## Initialize and train\n\nSupported in Databricks Connect v2\n\n- `ml_pipeline()` - Initializes a new Spark Pipeline.\n- `ml_fit()` - Trains the model, outputs a Spark Pipeline Model.\n\n## Save and retrieve\n\nSupported in Databricks Connect v2\n\n- `ml_save()` - Saves into a format that can be read by Scala and PySpark.\n- `ml_read()` - Reads Spark object into sparklyr.\n\n![](images/sparklyr-save-retrieve.png){fig-alt=\"ml_pipeline() to ft_dplyr_transformer to ft_bucketizer() to ml_linear_regression() to ml_fit() to ml_save().\" fig-align=\"center\" width=\"501\"}\n\n[spark.posit.co/guides/pipelines](https://spark.posit.co/guides/pipelines)\n\n# Distributed R\n\nSupported in Databricks Connect v2\n\nRun arbitrary R code at scale inside your cluster with spark_apply().\nUseful when there you need functionality only available in R, and to solve 'embarrassingly parallel problems'.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nspark_apply(\n x, f, \n columns = NULL, memory = TRUE, group_by = NULL, \n name = NULL, barrier = NULL, fetch_result_as_sdf = TRUE\n )\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ncopy_to(sc, mtcars) %>%\n spark_apply(\n nrow, # R only function\n group_by = \"am\", \n columns = \"am double, x long\"\n )\n```\n:::\n\n\n\n------------------------------------------------------------------------\n\nCC BY SA Posit Software, PBC • [info\\@posit.co](mailto:[email protected]) • [posit.co](https://posit.co)\n\nLearn more at [spark.posit.co](https://spark.posit.co/) and [therinspark.com](https://therinspark.com/).\n\nUpdated: 2024-05.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\npackageVersion(\"sparklyr\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] '1.8.6'\n```\n\n\n:::\n:::\n\n\n\n------------------------------------------------------------------------\n",
"supporting": [],
"filters": [
"rmarkdown/pagebreak.lua"
],
Expand Down
Loading

0 comments on commit 483c5f7

Please sign in to comment.