Merge pull request #287 from rstudio/sparklyr-html

Quarto/HTML cheatsheet for sparklyr
rstudio · May 31, 2024 · 483c5f7 · 483c5f7
2 parents 4fbbb36 + e2ee7e9
commit 483c5f7
Show file tree

Hide file tree

Showing 7 changed files with 857 additions and 8 deletions.
diff --git a/_freeze/html/sparklyr/execute-results/html.json b/_freeze/html/sparklyr/execute-results/html.json
@@ -1,11 +1,9 @@
 {
-  "hash": "dfa5d20431126ba509f25c2253692592",
+  "hash": "8ebe6bf2cb68f3277c06277a66ff6e8c",
   "result": {
     "engine": "knitr",
-    "markdown": "---\ntitle: \"Data science in Spark with sparklyr :: Cheatsheet\"\ndescription: \" \"\nimage-alt: \"\"\nexecute:\n  eval: true\n  output: false\n  warning: false\n---\n\n::: {.cell .column-margin}\n<img src=\"images/logo-sparklyr.png\" height=\"138\" alt=\"Hex logo for sparklyr - Neon shooting stars of various shapes and sizes flying across a black and grey background.\" />\n<br><br><a href=\"../sparklyr.pdf\">\n<p><i class=\"bi bi-file-pdf\"></i> Download PDF</p>\n<img src=\"../pngs/sparklyr.png\" width=\"200\" alt=\"\"/>\n</a>\n<br><br><p>Translations (PDF)</p>\n* <a href=\"../translations/chinese/sparklyr_zh_cn.pdf\"><i class=\"bi bi-file-pdf\"></i>Chinese</a>\n* <a href=\"../translations/chinese/sparklyr_zh_tw.pdf\"><i class=\"bi bi-file-pdf\"></i>Chinese</a>\n* <a href=\"../translations/german/sparklyr_de.pdf\"><i class=\"bi bi-file-pdf\"></i>German</a>\n* <a href=\"../translations/japanese/sparklyr_ja.pdf\"><i class=\"bi bi-file-pdf\"></i>Japanese</a>\n* <a href=\"../translations/spanish/sparklyr_es.pdf\"><i class=\"bi bi-file-pdf\"></i>Spanish</a>\n:::\n\n\n\n\n<!-- Page 1 -->\n\nHTML version coming soon!\nThe PDF is available to download [here](../sparklyr.pdf).\n",
-    "supporting": [
-      "sparklyr_files"
-    ],
+    "markdown": "---\ntitle: \"Data science in Spark with sparklyr :: Cheatsheet\"\ndescription: \" \"\nimage-alt: \"\"\nexecute:\n  eval: false\n  output: false\n  warning: false\n---\n\n::: {.cell .column-margin}\n\n:::\n\n\n\n<!-- Page 1 -->\n\n![](images/sparklyr-ds-workflow.png)\n\n# Connect\n\n## Databricks Connect (v2)\n\nSupported in Databricks Connect v2\n\n1.  Open your .Renviron file: `usethis::edit_r_environ()`\n\n2.  In the .Renviron file add your Databricks Host Url and Token (PAT):\n\n    -   `DATABRICKS_HOST = \\[Your Host URL\\]`\n    -   `DATABRICKS_TOKEN = \\[Your PAT\\]`\n\n3.  Install extension: `install.packages(\"pysparklyr\")`\n\n4.  Open connection:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsc <- spark_connect(\n  cluster_id = \"[Your cluster’s ID]\",\n  method = \"databricks_connect\"\n)\n```\n:::\n\n\n\n## Standalone cluster\n\n1.  Install RStudio Server on one of the existing nodes or a server in the same LAN\n\n2.  Open a connection\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nspark_connect(\n  master=\"spark://host:port\",\n  version = \"3.2\",\n  spark_home = [path to Spark]\n)\n```\n:::\n\n\n\n## Yarn client\n\n1.  Install RStudio Server on an edge node\n\n2.  Locate path to the clusterʼs Spark Home Directory, it normally is `\"/usr/lib/spark\"`\n\n3.  Basic configuration example\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nconf <- spark_config()\nconf$spark.executor.memory <- \"300M\"\nconf$spark.executor.cores <- 2\nconf$spark.executor.instances <- 3\nconf$spark.dynamicAllocation.enabled<-\"false\"\n```\n:::\n\n\n\n4.  Open a connection\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsc <- spark_connect(\n  master = \"yarn\",\n  spark_home = \"/usr/lib/spark/\",\n  version = \"2.1.0\", config = conf\n)\n```\n:::\n\n\n\n## Yarn cluster\n\n1.  Make sure to have copies of the **yarn-site.xml** and **hive-site.xml** files in the RStudio Server\n\n2.  Point environment variables to the correct paths\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nSys.setenv(JAVA_HOME=\"[Path]\")\nSys.setenv(SPARK_HOME =\"[Path]\")\nSys.setenv(YARN_CONF_DIR =\"[Path]\")\n```\n:::\n\n\n\n3.  Open a connection\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsc <- spark_connect(master = \"yarn-cluster\")\n```\n:::\n\n\n\n## Kubernetes\n\n1.  Use the following to obtain the Host and Port `system2(\"kubectl\", \"cluster-info\")`\n\n2.  Open a connection\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsc <- spark_connect(\n  config = spark_config_kubernetes(\n    \"k8s://https://[HOST]>:[PORT]\",\n    account = \"default\",\n    image = \"docker.io/owner/repo:version\"\n    )\n  )\n```\n:::\n\n\n\n## Local mode\n\nNo cluster required.\nUse for learning purposes only\n\n1.  Install a local version of Spark: spark_install()\n\n2.  Open a connection\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsc <- spark_connect(master=\"local\") \n```\n:::\n\n\n\n## Cloud\n\n**Azure** - `spark_connect(method = \"synapse\")`\n\n**Qubole** - `spark_connect(method = \"qubole\")`\n\n# Import\n\n![](images/sparklyr-import.png){fig-alt=\"R push complete to Spark back to R collecting results. Also import from Source to Spark.\" fig-align=\"center\" width=\"501\"}\n\nImport data into Spark, not R\n\n## Read a file into Spark\n\n**Arguments that apply to all functions:**\n\nsc, name, path, options=list(), repartition=0, memory=TRUE, overwrite=TRUE\n\n-   CSV: `spark_read_csv(header = TRUE, columns = NULL, infer_schema = TRUE, delimiter = \",\", quote= \"\\\"\", escape = \"\\\\\", charset = \"UTF-8\", null_value = NULL)`\n-   JSON: `spark_read_json()`\n-   PARQUET: `spark_read_parquet()`\n-   TEXT: `spark_read_text()`\n-   DELTA: `spark_read_delta()`\n\n## From a table\n\n-   `dplyr::tbl(scr, ...)` - Creates a reference to the table without loading its data into memory\n\n-   `dbplyr::in_catalog()` - Enables a three part table address\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- tbl(sc,in_catalog(\"catalog\", \"schema\", \"table\"))\n```\n:::\n\n\n\n## R data frame into Spark\n\nSupported in Databricks Connect v2\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndplyr::copy_to(dest, df, name)\n```\n:::\n\n\n\nApache Arrow accelerates data transfer between R and Spark.\nTo use, simply load the library\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(sparklyr)\nlibrary(arrow)\n```\n:::\n\n\n\n# Wrangle\n\n## dplyr verbs\n\nSupported in Databricks Connect v2\n\nTranslates into Spark SQL statements\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncopy_to(sc, mtcars) %>%\n  mutate(trm = ifelse(am == 0, \"auto\", \"man\")) %>%\n  group_by(trm) %>%\n  summarise_all(mean)\n```\n:::\n\n\n\n## tidyr\n\n-   `pivot_longer()` - Collapse several columns into two.\n    (Supported in Databricks Connect v2)\n\n-   `pivot_wider()` - Expand two columns into several.\n    (Supported in Databricks Connect v2)\n\n-   `nest()` / `unnest()` - Convert groups of cells into list-columns, and vice versa.\n\n-   `unite()` / `separate()` - Split a single column into several columns, and vice versa.\n\n-   `fill()` - Fill NA with the previous value.\n\n## Feature transformers\n\n-   `ft_binarizer()` - Assigned values based on threshold\n-   `ft_bucketizer()` - Numeric column to discretized column\n-   `ft_count_vectorizer()` - Extracts a vocabulary from document\n-   `ft_discrete_cosine_transform()` - 1D discrete cosine transform of a real vector\n-   `ft_elementwise_product()` - Element- wise product between 2 cols\n-   `ft_hashing_tf()` - Maps a sequence of terms to their term frequencies using the hashing trick.\n-   `ft_idf()` - Compute the Inverse Document Frequency (IDF) given a collection of documents.\n-   `ft_imputer()` - Imputation estimator for completing missing values, uses the mean or the median of the columns.\n-   `ft_index_to_string()` - Index labels back to label as strings\n-   `ft_interaction()` - Takes in Double and Vector columns and outputs a flattened vector of their feature interactions.\n-   `ft_max_abs_scaler()` - Rescale each feature individually to range \\[-1, 1\\] (Supported in Databricks Connect v2)\n-   `ft_min_max_scaler()` - Rescale each feature to a common range \\[min, max\\] linearly\n-   `ft_ngram()` - Converts the input array of strings into an array of n-grams\n-   `ft_bucketed_random_projection_lsh()`\n-   `ft_minhash_lsh()` - Locality Sensitive Hashing functions for Euclidean distance and Jaccard distance (MinHash)\n-   `ft_normalizer()` - Normalize a vector to have unit norm using the given p-norm\n-   `ft_one_hot_encoder()` - Continuous to binary vectors\n-   `ft_pca()` - Project vectors to a lower dimensional space of top k principal components\n-   `ft_quantile_discretizer()` - Continuous to binned categorical values.\n-   `ft_regex_tokenizer()` - Extracts tokens either by using the provided regex pattern to split the text\n-   `ft_robust_scaler()` - Removes the median and scales according to standard scale\n-   `ft_standard_scaler()` - Removes the mean and scaling to unit variance using column summary statistics (Supported in Databricks Connect v2)\n-   `ft_stop_words_remover()` - Filters out stop words from input\n-   `ft_string_indexer()` - Column of labels into a column of label indices.\n-   `ft_tokenizer()` - Converts to lowercase and then splits it by white spaces\n-   `ft_vector_assembler()` - Combine vectors into single row-vector\n-   `ft_vector_indexer()` - Indexing categorical feature columns in a dataset of Vector\n-   `ft_vector_slicer()` - Takes a feature vector and outputs a new feature vector with a subarray of the original features\n-   `ft_word2vec()` - Word2Vec transforms a word into a code\n\n<!-- Page 2 -->\n\n# Visualize\n\n## dplyr + ggplot2\n\nSupported in Databricks Connect v2\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncopy_to(sc, mtcars) %>%\n  group_by(cyl) %>%\n  summarise(mpg_m = mean(mpg)) %>% # Summarize in Spark\n  collect() %>%                    # Collect results in R\n  ggplot() +\n  geom_col(aes(cyl, mpg_m))        # Create plot\n```\n:::\n\n\n\n# Modeling\n\n## Regression\n\n-   `ml_linear_regression()` - Linear regression.\n-   `ml_aft_survival_regression()` - Parametric survival regression model named accelerated failure time (AFT) model.\n-   `ml_generalized_linear_regression()` - GLM.\n-   `ml_isotonic_regression()` - Uses parallelized pool adjacent violators algorithm.\n-   `ml_random_forest_regressor()` - Regression using random forests.\n\n## Classification\n\n-   `ml_linear_svc()` - Classification using linear support vector machines.\n-   `ml_logistic_regression()` - Logistic regression. (Supported in Databricks Connect v2)\n-   `ml_multilayer_perceptron_classifier()` - Based on the Multilayer Perceptron.\n-   `ml_naive_bayes()` - It supports Multinomial NB which can handle finitely supported discrete data.\n-   `ml_one_vs_rest()` - Reduction of Multiclass, performs reduction using one against all strategy.\n\n## Tree\n\n-   `ml_decision_tree_classifier()`, `ml_decision_tree()`, \\`ml_decision_tree_regressor(). - Classification and regression using decision trees.\n-   `ml_gbt_classifier()`, `ml_gradient_boosted_trees()`, `ml_gbt_regressor()` - Binary classification and regression using gradient boosted trees.\n-   `ml_random_forest_classifier()` - Classification and regression using random forests.\n-   `ml_feature_importances()`, `ml_tree_feature_importance()` - Feature Importance for Tree Models.\n\n## Clustering\n\n-   `ml_bisecting_kmeans()` - A bisecting k-means algorithm based on the paper.\n-   `ml_lda()`, `ml_describe_topics()`, `ml_log_likelihood()`, `ml_log_perplexity()`, `ml_topics_matrix()` - LDA topic model designed for text documents.\n-   `ml_gaussian_mixture()` - Expectation maximization for multivariate Gaussian Mixture Models (GMMs).\n-   `ml_kmeans()`, `ml_compute_cost()`, `ml_compute_silhouette_measure()` - Clustering with support for k-means.\n-   `ml_power_iteration()` - For clustering vertices of a graph given pairwise similarities as edge properties.\n\n## Recommendation\n\n-   `ml_als()`, `ml_recommend()` - Recommendation using Alternating Least Squares matrix factorization.\n\n## Evaluation\n\n-   `ml_clustering_evaluator()` - Evaluator for clustering.\n-   `ml_evaluate()` - Compute performance metrics.\n-   `ml_binary_classification_evaluator()`, `ml_binary_classification_eval()`, `ml_classification_eval()` - A set of functions to calculate performance metrics for prediction models.\n\n## Frequent pattern\n\n-   `ml_fpgrowth()`, `ml_association_rules()`, `ml_freq_itemsets()` - A parallel FP-growth algorithm to mine frequent itemsets.\n-   `ml_freq_seq_patterns()`, `ml_prefixspan()` - PrefixSpan algorithm for mining frequent itemsets.\n\n## Stats\n\n-   `ml_summary()` - Extracts a metric from the summary object of a Spark ML model.\n-   `ml_corr()` - Compute correlation matrix.\n\n## Feature\n\n-   `ml_chisquare_test(x,features,label)` - Pearson's independence test for every feature against the label.\n-   `ml_default_stop_words()` - Loads the default stop words for the given language.\n\n## Utilities\n\n-   `ml_call_constructor()` - Identifies the associated sparklyr ML constructor for the JVM.\n-   `ml_model_data()` - Extracts data associated with a Spark ML model.\n-   `ml_standardize_formula()` - Generates a formula string from user inputs.\n-   `ml_uid()` - Extracts the UID of an ML object.\n\n# ML pipelines\n\nEasily create a formal Spark Pipeline models using R.\nSave the Pipeline in native Sacala.\nIt will have **no dependencies** on R.\n\n## Initialize and train\n\nSupported in Databricks Connect v2\n\n-   `ml_pipeline()` - Initializes a new Spark Pipeline.\n-   `ml_fit()` - Trains the model, outputs a Spark Pipeline Model.\n\n## Save and retrieve\n\nSupported in Databricks Connect v2\n\n-   `ml_save()` - Saves into a format that can be read by Scala and PySpark.\n-   `ml_read()` - Reads Spark object into sparklyr.\n\n![](images/sparklyr-save-retrieve.png){fig-alt=\"ml_pipeline() to ft_dplyr_transformer to ft_bucketizer() to ml_linear_regression() to ml_fit() to ml_save().\" fig-align=\"center\" width=\"501\"}\n\n[spark.posit.co/guides/pipelines](https://spark.posit.co/guides/pipelines)\n\n# Distributed R\n\nSupported in Databricks Connect v2\n\nRun arbitrary R code at scale inside your cluster with spark_apply().\nUseful when there you need functionality only available in R, and to solve 'embarrassingly parallel problems'.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nspark_apply(\n  x, f, \n  columns = NULL, memory = TRUE, group_by = NULL, \n  name = NULL, barrier = NULL, fetch_result_as_sdf = TRUE\n  )\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ncopy_to(sc, mtcars) %>%\n  spark_apply(\n    nrow, # R only function\n    group_by = \"am\", \n    columns = \"am double, x long\"\n  )\n```\n:::\n\n\n\n------------------------------------------------------------------------\n\nCC BY SA Posit Software, PBC • [info\\@posit.co](mailto:[email protected]) • [posit.co](https://posit.co)\n\nLearn more at [spark.posit.co](https://spark.posit.co/) and [therinspark.com](https://therinspark.com/).\n\nUpdated: 2024-05.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\npackageVersion(\"sparklyr\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] '1.8.6'\n```\n\n\n:::\n:::\n\n\n\n------------------------------------------------------------------------\n",
+    "supporting": [],
     "filters": [
       "rmarkdown/pagebreak.lua"
     ],