diff --git a/README.md b/README.md
index e22b0a35..9989fe78 100644
--- a/README.md
+++ b/README.md
@@ -1,16 +1,9 @@
-# RAPIDS Notebooks-Contrib
+# RAPIDS Community Contrib
---
## Table of Contents
* [Intro](#intro)
-* [Installation](#install)
-* [Exploring the Repo](#explore)
-
-Notebooks:
-* [Getting Started](#get_started)
-* [Intermideate](#middle)
-* [Advanced](#advanced)
-* [BLOGS](#blogs)
-* [Conference](#conference)
+* [Exploring the Repo](#exploring)
+* [Great places to get started](#get_started)
---
@@ -18,151 +11,165 @@ Notebooks:
Welcome to the community contributed notebooks repo! (formerly known as Notebooks-Extended)
-The purpose of this collection of notebooks is to help users understand what RAPIDS has to offer, learn why, how, and when including RAPIDS in a data science pipeline makes sense, and contain community contributions of RAPIDS knowledge. The difference between this repo and the [Notebooks Repo](https://github.com/rapidsai/notebooks) are:
-1. These are vetted, community-contributed notebooks (includes RAPIDS team member contributions).
-1. These notebooks won't run on air gapped systems, which is one of our container requirements. Many RAPIDS notebooks use additional PyData ecosystem packages, and include code for downloading datasets, thus they require network connectivity. If running on a system with no network access, please download all the data that you plan to use ahead of time or simply use the [core notebooks repo](https://github.com/rapidsai/notebooks).
+The purpose of this collection is to introduce RAPIDS to new users by providing useful jupyter notebooks as learning aides. This collection of notebooks are direct community contributions by the RAPIDS team, our Ecosystem Partners, and RAPIDS users like you!
+### What do you mean "Community Notebooks"
-## Installation
+These notebooks are for the community. It means:
+1. YOU can contribute workflow examples, tips and tricks, or tutorials for others to use and share! [We ask that you follow our Testing and PR process.](#contributing)
+2. If your notebook is awesome, your notebook can be featured
-Please use the [BUILD.md](BUILD.md) to check the pre-requisite packages and installation steps.
+There are some additional Community Responsibilities, as the RAPIDS team isn't maintaining these notebooks
+- If you write an awesome notebook, please try to keep it maintained. You'll be mentioned on the issue.
+- If you find an issue, don't just file an issue - please attempt to fix it!
+- If a notebook has a problem and/or its last tested RAPIDS release version is in legacy, it may be removed to archives.
-## Contributing
+### RAPIDS Showcase Notebooks
+These notebooks are built by the RAPIDS team and will be maintained by them. When we remove the notebooks, it will become community maintained until it hits `the_archive`
+
+### How to Contribute
Please see our [guide for contributing to notebooks-contrib](CONTRIBUTING.md).
Once you've followed our guide, please don't forget to [test your notebooks!](TESTING.md) before making a PR.
-## Exploring the Repo
+## Exploring the Repo
### Folders
- `getting_started_notebooks` - “how to start using RAPIDS”. Contains notebooks showing "hello worlds", getting started with RAPIDS libraries, and tutorials around RAPIDS concepts.
-- `intermediate_notebooks` - “how to accomplish your workflows with RAPIDS”. Contains notebooks showing algorithm and workflow examples, benchmarking tools, and some complete end-to-end (E2E) workflows.
-- `advanced_notebooks` - "how to master RAPIDS". Contains notebooks showing kernel customization and advanced end-to-end workflows.
-- `blog notebooks` - contains shared notebooks mentioned and used in blogs that showcase RAPIDS workflows and capabilities
-- `conference notebooks` - contains notebooks used in conferences, such as GTC
+- `community_tutorials_and_guides` - community contributed “how to accomplish your workflows with RAPIDS”. Contains notebooks showing algorithm and workflow examples, benchmarking tools, and some complete end-to-end (E2E) workflows.
+- `community_archive` - This contains notebooks with known issues that have not have not been fixed in 45 days or more. contains shared notebooks mentioned and used in blogs that showcase RAPIDS workflows and capabilities
+- `the_archive` - contains older notebooks from community members as well as notebooks that the RAPIDS team no longer updates, but are useful to the community, such as [`archived_rapids_blog_notebooks`](community_relaunch/the_archive/archived_rapids_blog_notebooks), [`archived_rapids_event_notebooks`](the_archive/archived_rapids_event_notebooks), and [`competition_notebooks`](the_archive/archived_rapids_competition_notebooks)
- `data` - contains small data samples used for purely functional demonstrations. Some notebooks include cells that download larger datasets from external websites.
-### Lists
-- `multimedia_links.md` is a [list of videos](multimedia_links.md) by RAPIDS or our community talking about or showing how to use RAPIDS. Feel free to contribute your videos and RAPIDS themed playlists as well!
-- `competition_notebooks.md` - contains archived notebooks that were used in competitions, such as Kaggle. Some of these notebooks were blogged about and can also be found in our `blog notebooks` folder.
-
-# Our Notebooks
-Below is a listing of the notebooks in this repository. Each row will tell you the notebook's
-- Location in **Folder**
-- Notebook Title and Direct Link in **Notebook Title**
-- Description in **Description**
-- Design is for a `Single GPU`(SG) or `Multiple GPUs`(MG) in **GPU** (don't worry, you can still run the multi-GPU notebooks with a single GPU)
-- Data can be found in **Dataset Used**
-
-
-## Getting Started Notebooks:
-
-| Folder | Notebook Title | Description | GPU | Dataset Used |
-|-----------|------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|------|--------------|
-| basics | [Getting_Started_with_cuDF](getting_started_notebooks/basics/Getting_Started_with_cuDF.ipynb) | This notebook shows how to get started with GPU DataFrames (single GPU only) using cuDF in RAPIDS. | SG | Self Generated |
-| basics | [Dask_Hello_World](getting_started_notebooks/basics/Dask_Hello_World.ipynb) | This notebook shows how to quickly setup Dask and run a "Hello World" example. | MG | Self Generated |
-| basics | [Getting_Started_with_Dask](getting_started_notebooks/basics/Getting_Started_with_Dask.ipynb) | This notebook shows how to get started with multi-GPU DataFrames using Dask and cuDF in RAPIDS. | MG | Self Generated |
-| basics | [hello_streamz](getting_started_notebooks/basics/hello_streamz.ipynb) | This notebook demonstrates use of cuDF to perform streaming word-count using a small portion of the Streamz API. | SG | Self Generated |
-|basics -> blazingsql| [Getting Started with BlazingSQL](getting_started_notebooks/basics/blazingsql/getting_started_with_blazingsql.ipynb) | How to set up and get started with BlazingSQL and the RAPIDS AI suite. | SG | [Music Dataset](https://github.com/BlazingDB/bsql-demos/blob/master/data/Music.csv) |
-|basics -> blazingsql| [Federated Query Demo](getting_started_notebooks/basics/blazingsql/federated_query_demo.ipynb) | In a single query, join an Apache Parquet file, a CSV file, and a GPU DataFrame (GDF) in GPU memory. | SG | [Breast Cancer Diagnostic](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29) |
-| intro_tutorials | [01_Introduction_to_RAPIDS](getting_started_notebooks/intro_tutorials/01_Introduction_to_RAPIDS.ipynb) | This notebook shows at a high level what each of the packages in RAPIDS are as well as what they do. | MG | Self Generated |
-| intro_tutorials | [02_Introduction_to_cuDF](getting_started_notebooks/intro_tutorials/02_Introduction_to_cuDF.ipynb) | This notebook shows how to work with cuDF DataFrames in RAPIDS. | SG | Self Generated |
-| intro_tutorials | [03_Introduction_to_Dask](getting_started_notebooks/intro_tutorials/03_Introduction_to_Dask.ipynb) | This notebook shows how to work with Dask using basic Python primitives like integers and strings. | MG | Self Generated |
-| intro_tutorials | [04_Introduction_to_Dask_using_cuDF_DataFrames](getting_started_notebooks/intro_tutorials/04_Introduction_to_Dask_using_cuDF_DataFrames.ipynb) | This notebook shows how to work with cuDF DataFrames using Dask. | MG | Self Generated |
-| intro_tutorials | [06_Introduction_to_Supervised_Learning](getting_started_notebooks/intro_tutorials/06_Introduction_to_Supervised_Learning.ipynb) | This notebook shows how to do GPU accelerated Supervised Learning in RAPIDS. | SG | Self Generated |
-| intro_tutorials | [07_Introduction_to_XGBoost](getting_started_notebooks/intro_tutorials/07_Introduction_to_XGBoost.ipynb) | This notebook shows how to work with GPU accelerated XGBoost in RAPIDS. | SG | Self Generated |
-| intro_tutorials | [08_Introduction_to_Dask_XGBoost](getting_started_notebooks/intro_tutorials/08_Introduction_to_Dask_XGBoost.ipynb) | This notebook shows how to work with Dask XGBoost in RAPIDS. | MG | Self Generated |
-| intro_tutorials | [09_Introduction_to_Dimensionality_Reduction](getting_started_notebooks/intro_tutorials/09_Introduction_to_Dimensionality_Reduction.ipynb) | This notebook shows how to do GPU accelerated Dimensionality Reduction in RAPIDS. | SG | Self Generated |
-| intro_tutorials | [10_Introduction_to_Clustering](getting_started_notebooks/intro_tutorials/10_Introduction_to_Clustering.ipynb) | This notebook shows how to do GPU accelerated Clustering in RAPIDS. | SG | Self Generated |
----
+### Additional Resources
+- [Visit out Youtube Channel](https://www.youtube.com/channel/UCsoi4wfweA3I5FsPgyQnnqw/featured?view_as=subscriber) or see [list of videos](multimedia_links.md) by RAPIDS or our community. Feel free to contribute your videos and RAPIDS themed playlists as well!
+- [Visit our Blogs on Medium](https://medium.com/rapids-ai/)
+
+## Great places to get started
+
+### Topics
+Click each topic to expand
+
+ RAPIDS Libraries Basics
+
+##### Getting Started Document
+* [Intro to RAPIDS](getting_started_materials/README.md)
+
+##### Teaching Notebooks
+* [Intro Notebooks to RAPIDS](getting_started_materials/intro_tutorials_and_guides)- covers cuDF, Dask, cuML and XGBoost.
+* [Learn RAPIDS Getting Started Tour (External)](https://github.com/RAPIDSAcademy/rapidsacademy/tree/master/tutorials/datasci/tour)
+* [Hello Worlds](getting_started_materials/hello_worlds)
+
+
+
+ Cloud Service Providers
+
+ * [AWS](https://rapids.ai/cloud#aws)
+ * [Single Instance](https://rapids.ai/cloud#AWS-EC2)
+ * [Multi GPU Dask](https://rapids.ai/cloud#AWS-Dask)
+ * [Kubernetes](https://rapids.ai/cloud#AWS-Kubernetes)
+ * [Sagemaker](https://rapids.ai/cloud#AWS-Sagemaker)
+ * [Video- Tutorial of RAPIDS on AWS Sagemaker](https://www.youtube.com/watch?v=BtE4d0v6Css)
+ * [Azure](https://rapids.ai/cloud#azure)
+ * [Single Instance](https://rapids.ai/cloud#AZ-single)
+ * [Multi GPU Dask](https://rapids.ai/cloud#AZ-Dask)
+ * [Kubernetes](https://rapids.ai/cloud#AZ-Kubernetes)
+ * [AzureML Service](https://rapids.ai/cloud#AZ-ML)
+ * [Video- Tutorial of RAPIDS on AzureML](https://www.youtube.com/watch?v=aqTmVVFnEwI)
+ * [GCP](https://rapids.ai/cloud#googlecloud)
+ * [Single Instance](https://rapids.ai/cloud#GC-single)
+ * [Multi GPU Dask (Dataproc)](https://rapids.ai/cloud#GC-Dask)
+ * [Kubernetes](https://rapids.ai/cloud#GC-Kubernetes)
+ * [CloudAI](https://rapids.ai/cloud#GC-AI)
+
+
+
+ Multi GPU
+
+* [Hello Word to Dask](getting_started_materials/hello_worlds/Dask_Hello_World.ipynb)
+* [Intro to Dask](getting_started_materials/intro_tutorials_and_guides/03_Introduction_to_Dask.ipynb)
+* [Dask using cuDF](getting_started_materials/intro_tutorials_and_guides/04_Introduction_to_Dask_using_cuDF_DataFrames.ipynb)
+* [Learn RAPIDS Multi GPU Mini Tour (External)](https://github.com/RAPIDSAcademy/rapidsacademy/tree/master/tutorials/multigpu/minitour)
+* NYC taxi on Dataproc
+* [Weather Analysis](community_tutorials_and_guides/intermediate_notebooks/examples/weather.ipynb)
+
+
+
+ Streaming Data
+
+* [Chinmay Chandak's cuStreamz Gists (External)](https://gist.github.com/chinmaychandak)
+* [Using cuStreamz to Accelerate your Kafka Datasource (Blog)](https://medium.com/rapids-ai/the-custreamz-series-the-accelerated-kafka-datasource-4faf0baeb3f6)
+* [GPU accelerated Stream processing with RAPIDS (Blog)](https://medium.com/rapids-ai/gpu-accelerated-stream-processing-with-rapids-f2b725696a61)
+* [Hello World Streaming Data](getting_started_materials/hello_worlds/hello_streamz.ipynb)
+
+
+
+ NLP
+
+* [NLP with Hashing Vectorizer (Blog)](https://medium.com/rapids-ai/gpu-text-processing-now-even-simpler-and-faster-bde7e42c8c8a)
+* [Show me the Word Count (Archives)](the_archive/archived_rapids_blog_notebooks/nlp/show_me_the_word_count_gutenberg)
+
+
+ Graph Analytics
+
+
+ GIS/Spatial Analytics
-## Intermediate Notebooks:
-| Folder | Notebook Title | Description | GPU | Dataset Used |
-|-----------|------------------------|-------------------------------------------------------------|------|--------------|
-| examples | [linear_regression_demo.ipynb](intermediate_notebooks/examples/linear_regression_demo.ipynb) | This notebook demos how to implement simple and multiple linear regression with cuML to predict median housing price on sklearn's Boston Housing dataset. With corresponding [Medium Story](http://bit.ly/cuml_lin_reg_friend). | SG | [SKLearn Boston Housing](https://scikit-learn.org/stable/datasets/index.html#boston-dataset)|
-| examples | [umap_demo_full](intermediate_notebooks/examples/umap_demo_full.ipynb) | In this notebook we will show how to use UMAP and its GPU accelerated implementation present in RAPIDS. | SG | [Fashion MNIST](https://github.com/zalandoresearch/fashion-mnist)|
-| examples | [rf_demo](intermediate_notebooks/examples/rf_demo.ipynb) | Demonstration of using both cuml and sklearn to train a RandomForestClassifier on the Higgs dataset. | SG | [Higgs Boson](https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz)
-| examples | [weather](intermediate_notebooks/examples/weather.ipynb) | Demonstration of using Dask and cuDF to process and analyze weather history | MG | [NOAA Annual Weather Data](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/) |
-| examples -> blazingsql| [BlazingSQL vs Spark](intermediate_notebooks/examples/blazingsql/vs_pyspark_netflow.ipynb) | Analyze 73 million rows of net flow data. Compare BlazingSQL and Apache Spark timings for the same workload. | SG | [University of New South Wales LanL Dataset](https://www.unsw.adfa.edu.au/unsw-canberra-cyber/cybersecurity/ADFA-NB15-Datasets/) |
-| examples -> blazingsql| [Taxi Fare Prediction](intermediate_notebooks/examples/blazingsql/taxi_fare_prediction.ipynb) | Build & test a cuML Linear Regression model to predict the cost of a ride from 20 million rows of NYC Taxi data. | SG | [NYC Taxi Dataset](https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_00.csv) |
-| examples -> custreamz | [parsing_haproxy_logs](intermediate_notebooks/examples/custreamz/parsing_haproxy_logs.ipynb) | This notebook builds upon the weblogs streaming notebook and demonstrates more advanced features for parsing HAProxy logs. | SG | Self Generated
-| examples -> cugraph | [MG Pagerank](intermediate_notebooks/examples/cugraph/multi_gpu_pagerank.ipynb) | Analyze a Twitter dataset (26GB on disk) with 41.7 million users with 1.47 billion social relations (edges) to find out the most influential profiles. | MG | [Twitter](https://s3.us-east-2.amazonaws.com/rapidsai-data/cugraph/benchmark/twitter-2010.csv.gz) |
-| E2E -> taxi | [NYCTaxi](intermediate_notebooks/E2E/taxi/NYCTaxi-E2E.ipynb) | Demonstrates multi-node ETL for cleanup of raw data into cleaned train and test dataframes. Shows how to run multi-node XGBoost training with dask-xgboost. **Please Note: requires Google Dataproc to run!** [Blog](https://medium.com/rapids-ai/scale-out-rapids-on-google-cloud-dataproc-8a873233258f) | MG | [Google Dataproc Hosted NYC Taxi Data](https://console.cloud.google.com/storage/browser/anaconda-public-data/nyc-taxi/csv/?pli=1) |
-| E2E -> synthetic_3D | [rapids_ml_workflow_demo](intermediate_notebooks/E2E/synthetic_3D/rapids_ml_workflow_demo.ipynb) | A 3D visual showcase of a machine learning workflow with RAPIDS (load data, transform/normalize, train XGBoost model, evaluate accuracy, use model for inference). Along the way we compare the performance gains of RAPIDS [GPU] vs sklearn/pandas methods [CPU]. | SG | SciKit-Learn's demo datasets |
-| E2E -> census | [census_education2income_demo](intermediate_notebooks/E2E/census/census_education2income_demo.ipynb) | In this notebook we use 50 years of census data to see how education affects income. | SG | [Custom IPUMS Data pull](https://rapidsai-data.s3.us-east-2.amazonaws.com/datasets/ipums_education2income_1970-2010.csv.gz)
-| E2E -> mortgage | [mortgage_e2e](intermediate_notebooks/E2E/mortgage/mortgage_e2e.ipynb) | This notebook demonstrates multi-GPU ETL and XGBoost for data preprocessing and training on 17 years of [Fannie Mae’s Single-Family Loan Performance Data](http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html). | MG | [Mortgage Loan Data](https://docs.rapids.ai/datasets/mortgage-data)
-| benchmarks | [cuml_benchmarks](intermediate_notebooks/benchmarks/cuml_benchmarks.ipynb) | The purpose of this notebook is to extensively benchmark all of the single GPU cuML algorithms against their skLearn counterparts, while also providing the ability to find and verify upper bounds. **Note: Best on large memory GPUs** | SG | Self Generated |
-| benchmarks | [rapids_decomposition](intermediate_notebooks/benchmarks/rapids_decomposition.ipynb) | This notebook benchmarks and visualize RAPIDS decomposition methods against each other. You have the opportunity to self-compare it to CPU speeds and methods | SG | SciKit-Learn's demo datasets |
-| benchmarks -> cugraph_benchmarks | [louvain_benchmark](intermediate_notebooks/benchmarks/cugraph_benchmarks/louvain_benchmark.ipynb) | This notebook benchmarks performance improvement of running the Louvain clustering algorithm within cuGraph against NetworkX. | SG | Sparse collection | SG | SciKit-Learn's demo datasets |
-| benchmarks -> cugraph_benchmarks | [pagerank_benchmark](intermediate_notebooks/benchmarks/cugraph_benchmarks/pagerank_benchmark.ipynb) | This notebook benchmarks performance improvement of running PageRank within cuGraph against NetworkX. | SG | Sparse collection |
-| benchmarks -> cugraph_benchmarks | [BFS benchmark](intermediate_notebooks/benchmarks/cugraph_benchmarks/bfs_benchmark.ipynb) | This notebook benchmarks performance improvement of running BFS within cuGraph against NetworkX. | SG | Sparse collection |
-| benchmarks -> cugraph_benchmarks | [SSSP_benchmark](intermediate_notebooks/benchmarks/cugraph_benchmarks/sssp_benchmark.ipynb) | This notebook benchmarks performance improvement of running SSSP within cuGraph against NetworkX. | SG | Sparse collection |
-| benchmarks -> cugraph_mg_hibench | [MG pagerank_benchmark](intermediate_notebooks/benchmarks/cugraph_mg_hibench/multi_gpu_pagerank.ipynb) | This notebook runs cuGraph's multi-GPU PageRank on a dataset of 300GB. It designed for DGX-2 machines. | MG | [HiBench](https://rapidsai-data.s3.us-east-2.amazonaws.com/cugraph/benchmark/hibench/HiBench_300GB.tar.gz) |
----
+* [Seismic Facies Analysis (External)](https://github.com/NVIDIA/energy-sdk/tree/master/rapids_seismic_facies)
+
+
+ Genomics
-## Advanced Notebooks:
-| Folder | Notebook Title | Description | GPU | Dataset Used |
-|-----------|------------------------|----------------------------------------------------------|------|--------------|
-| tutorials | [rapids_customized_kernels](advanced_notebooks/tutorials/rapids_customized_kernels.ipynb) | **Archive Only.** This notebook shows how create customized kernels using CUDA to make your workflow in RAPIDS even faster. | SG | Self Generated |
----
+ * [Video- GPU accelerated Single Cell Analytics](https://www.youtube.com/watch?v=nYneL_uif3Q)
+
+
+ Cybersecurity
+* [RAPIDS CLX](https://docs.rapids.ai/api/clx/stable/)
+ * [CLX API Docs](https://docs.rapids.ai/api/clx/stable/api.html)
+ * [10 Minutes to CLX](https://docs.rapids.ai/api/clx/stable/10min-clx.html)
+ * [Getting Started with CLX and Streamz](https://docs.rapids.ai/api/clx/stable/intro-clx-streamz.html)
+* [Learn RAPIDS Cyber Security mini Tour (External)](https://github.com/RAPIDSAcademy/rapidsacademy/tree/master/tutorials/security/tour)
+* [Cyber Blog Notebooks (Archives)](the_archive/archived_rapids_blog_notebooks/cyber)
-## Blog Notebooks:
-| Folder | Notebook Title | Description | GPU | Dataset Used |
-|-----------|------------------------|------------------------------------------------------------|------|--------------|
-| cyber | [flow_classification_rapids](blog_notebooks/cyber/flow_classification/flow_classification_rapids.ipynb) | **Archive Only.** The `cyber` folder contains the associated companion files for the blog [GPU Accelerated Cyber Log Parsing with RAPIDS](https://medium.com/rapids-ai/gpu-accelerated-cyber-log-parsing-with-rapids-10896f57eee9), by Bianca Rhodes US, Bhargav Suryadevara, and Nick Becker. This notebook demonstrates how to load netflow data into cuDF and create a multiclass classification model using XGBoost. Uses [run_raw_data_generator](blog_notebooks/cyber/raw_data_generator/run_raw_data_generator.py) | SG | [University of New South Wales LanL Dataset](https://iotanalytics.unsw.edu.au/) |
-| cyber | [lanl_network_mapping_using_rapids](blog_notebooks/cyber/network_mapping/lanl_network_mapping_using_rapids.ipynb) | **Archive Only.** The `cyber` folder contains the associated companion files for the blog [GPU Accelerated Cyber Log Parsing with RAPIDS](https://medium.com/rapids-ai/gpu-accelerated-cyber-log-parsing-with-rapids-10896f57eee9), by Bianca Rhodes US, Bhargav Suryadevara, and Nick Becker. This notebook demonstrates how to parse raw windows event logs using cudf and uses cuGraph's pagerank model to build a network graph. Uses [run_raw_data_generator](blog_notebooks/cyber/raw_data_generator/run_raw_data_generator.py) | SG | [University of New South Wales LanL Dataset](https://iotanalytics.unsw.edu.au/) |
-| databricks | [RAPIDS_PCA_demo_avro_read](blog_notebooks/databricks/RAPIDS_PCA_demo_avro_read.ipynb) | The `databricks` folder is the companion file repository to the blog [RAPIDS can now be accessed on Databricks Unified Analytics Platform](https://medium.com/rapids-ai/rapids-can-now-be-accessed-on-databricks-unified-analytics-platform-666e42284bd1) by Ikroop Dhillon, Karthikeyan Rajendran, and Taurean Dyer. This notebooks purpose is to showcase RAPIDS on Databricks use their sample datasets and show the CPU vs GPU comparison for the PCA algorithm. There is also an accompanying HTML file for easy Databricks import. **This notebook is for illustrative purposes only! Do not expect this notebook to successfully run on its own- this notebook's code is replicates a workflow meant to run on a specific platform, `Databricks`** | SG | [RAPIDS Toy Data](https://s3.us-east-2.amazonaws.com/rapidsai-data/datasets/mortgage/mortgage.npy.gz)|
-| plasticc | [rapids_lsst_full_demo](blog_notebooks/plasticc/notebooks/rapids_lsst_full_demo.ipynb) | **Archive Only.** This notebook demos the full CPU and GPU implementation of the RAPIDS.ai team's model that placed 8/1094 in the PLAsTiCC Astronomical Classification competition. [Blog](https://medium.com/rapids-ai/make-sense-of-the-universe-with-rapids-ai-d105b0e5ec95). [Updated notebooks found here](conference_notebooks/KDD_2019/plasticc/) | MG | [Kaggle PLAsTiCC-2018 dataset](https://www.kaggle.com/c/PLAsTiCC-2018/data) |
-| plasticc | [rapids_lsst_gpu_only_demo](blog_notebooks/plasticc/notebooks/rapids_lsst_gpu_only_demo.ipynb) | **Archive Only.** This GPU only based notebook shows the RAPIDS speedup of the RAPIDS.ai team's model that placed 8/1094 in the PLAsTiCC Astronomical Classification competition. [Blog](https://medium.com/rapids-ai/make-sense-of-the-universe-with-rapids-ai-d105b0e5ec95). [Updated notebooks found here](conference_notebooks/KDD_2019/plasticc/) | MG | [Kaggle PLAsTiCC-2018 dataset](https://www.kaggle.com/c/PLAsTiCC-2018/data) |
-| santander | [cudf_tf_demo](blog_notebooks/santander/cudf_tf_demo.ipynb) | **Archive Only.** This financial industry facing notebook is the cudf-tensorflow approach from the RAPIDS.ai team for Santander Customer Transaction Prediction. Placed 17/8808. [Blog](https://medium.com/rapids-ai/financial-data-modeling-with-rapids-5bca466f348) | SG | [Kaggle Santander Customer Transaction Prediction Dataset]( https://www.kaggle.com/c/santander-customer-transaction-prediction/data)
-| santander | [E2E_santander_pandas](blog_notebooks/santander/E2E_santander_pandas.ipynb) | **Archive Only.** This This financial data modelling notebook is the Pandas based version the RAPIDS.ai team's best single model for Santander Customer Transaction Prediction competition. Placed 17/8808. [Blog](https://medium.com/rapids-ai/financial-data-modeling-with-rapids-5bca466f348) | SG | [Kaggle Santander Customer Transaction Prediction Dataset]( https://www.kaggle.com/c/santander-customer-transaction-prediction/data)
-| santander | [E2E_santander](blog_notebooks/santander/E2E_santander.ipynb) | **Archive Only.** This financial data modelling notebook is the cuDF based version of the RAPIDS.ai team's best single model for Santander Customer Transaction Prediction competition. It allows you to compare cuDF performance to the Pandas version. Placed 17/8808. [Blog](https://medium.com/rapids-ai/financial-data-modeling-with-rapids-5bca466f348). | SG | [Kaggle Santander Customer Transaction Prediction Dataset]( https://www.kaggle.com/c/santander-customer-transaction-prediction/data)
-| regression | [regression_blog_notebook](blog_notebooks/regression/regression_blog_notebook.ipynb) | This is the companion notebook for the blog [Essential Machine Learning with Linear Models in RAPIDS: part 1 of a series](https://medium.com/rapids-ai/essential-machine-learning-with-linear-models-in-rapids-part-1-of-a-series-992fab0240da) by Paul Mahler. It showcases an end to end notebook using the Bike Share dataset and cuML's implementation of ridge regression. | SG | [Bike Share Dataset]() |
-| regression | [regression_2_blog](blog_notebooks/regression/regression_2_blog.ipynb) | This is the companion notebook for the blog [Regression Blog 2: We’re Practically Giving These Regressions Away](https://medium.com/rapids-ai/regression-blog-2-were-practically-giving-these-regressions-away-932669f52d3b) by Paul Mahler. It showcases an end to end notebook using the Black Friday dataset and cuML's implementations of L1 and L2 regularizations using Ridge, Lasso, and ElasticNet regression techniques. | SG | [Analytics Vidhya Black Friday Hackathon Dataset](https://datahack.analyticsvidhya.com/contest/black-friday/) |
-| NLP | [show_me_the_word_count_gutenberg](blog_notebooks/nlp/show_me_the_word_count_gutenberg/show_me_the_word_count_gutenberg.ipynb) | This is the notebook for blog [Show Me The Word Count](https://medium.com/rapids-ai/show-me-the-word-count-3146e1173801) by Vibhu Jawa, Nick Becker, David Wendt, and Randy Gelhausen. This notebook showcases NLP pre-processing capabilties of nvstrings+cudf on the Gutenberg dataset. | SG | [Gutenburg Dataset](https://web.eecs.umich.edu/~lahiri/gutenberg_dataset.html) |
-|cuspatial | [accelerate_geospatial_processing](blog_notebooks/cuspatial/trajectory_clustering.ipynb) | This is the notebook for blog [cuSpatial Accelerates Geospatial and Spatiotemporal Processing](https://medium.com/rapids-ai/releasing-cuspatial-to-accelerate-geospatial-and-spatiotemporal-processing-b686d8b32a9) by Milind Naphade, Jianting Zhang, Shuo Wang, Thomson Comer, Josh Paterson, Keith Kraus, Mark Harris, and Sujit Biswas. This notebook showcases cuSpatial benchmarking of directed Hausdorff distance for computing trajectory clustering on a large dataset. | SG | Trajectories Data and target_intersection.png |
-| randomforest | [fruits_rf_notebook](blog_notebooks/randomforest/fruits_rf_notebook.ipynb) | This is the notebook for blog [GPU-accelerated Random Forest]() by Vishal Mehta, Myrto Papadopoulou, Thejaswi Rao. This notebook showcases how to use GPU accelerated Random Forest Classification in cuML. The fruit dataset used is Self generated and used as an example in the [Blog](https://medium.com/rapids-ai/accelerating-random-forests-up-to-45x-using-cuml-dfb782a31bea) | SG | Self Generated
-| mortgage deep learning | [mortgage_e2e_deep_learning](blog_notebooks/mortgage_deep_learning/mortgage_e2e_deep_learning.ipynb) | **Archive Only.** This end to end notebook for the blog, [Using RAPIDS with PyTorch](https://medium.com/rapids-ai/using-rapids-with-pytorch-e602da018285), by Even Oldridge, combines the RAPIDS GPU data processing with a PyTorch deep learning neural network to predict mortgage loan delinquency. | MG | [Fannie Mae Mortgage Dataset](https://rapidsai.github.io/demos/datasets/mortgage-data)
-| svm | [svc_covertype](blog_notebooks/svm/svc_covertype.ipynb) | This notebook provides supplementary information for the Benchmark section of the [RAPIDS cuML SVC blog](https://nvda.ws/3c3Qy8H) post. | SG | [UCI Forest covertype dataset](https://archive.ics.uci.edu/ml/datasets/covertype)
----
+
+
+ Past Competitions
+
+- [RAPIDS.AI KGMON Competition Notebooks](the_archive/archived_competition_notebooks/kaggle)- contains a selection of notebooks that were used in Kaggle competitions.
+
-## Conference Notebooks:
-
-| Folder | Notebook Title | Description | GPU | Dataset Used |
-| ----------- | ------------------------ | --------------------------------------------------------------- | ---- | ------------ |
-| GTC_SJ_2019 | [GTC_tutorial_instructor](conference_notebooks/GTC_SJ_2019/GTC_tutorial_instructor.ipynb) | This is the instructor notebook for the hands on RAPIDS tutorial presented at San Jose's GTC 2019. It contains all the demonstrated solutions. | SG | [Analytics Vidhya Black Friday Hackathon Dataset](https://datahack.analyticsvidhya.com/contest/black-friday/) |
-| GTC_SJ_2019 | [GTC_tutorial_student](conference_notebooks/GTC_SJ_2019/GTC_tutorial_student.ipynb) | This is the exercise-filled student notebook for the hands on RAPIDS tutorial presented at San Jose's GTC 2019 | SG | [Analytics Vidhya Black Friday Hackathon Dataset](https://datahack.analyticsvidhya.com/contest/black-friday/) |
-| | | | | |
-| KDD_2019 | [Cybersecurity_KDD](conference_notebooks/KDD_2019/cyber/Cybersecurity_KDD.ipynb) | Using RAPIDS on network traffic and metadata, we demonstrate how to: 1. Triage and perform data exploration, 2. Model network data as a graph, 3. Perform graph analytics on the graph representation of the cyber network data, and 4. Prepare the results in a way that is suitable for visualization. | SG | [IDS 2018 dataset](https://www.unb.ca/cic/datasets/ids-2018.html) |
-| KDD_2019 | [MiningFrequentPatternsFromGraphs](conference_notebooks/KDD_2019/graph_pattern_mining/MiningFrequentPatternsFromGraphs.ipynb) | This notebook uses PC failure metadata, turns it into a coordinate list, and uses cugraph to find frequent patterns about the population that has failed | SG | [Microsoft PC Failure Metadata Graph](https://s3.us-east-2.amazonaws.com/rapidsai-data/datasets/fpm_graph/coo_fpm.csv.lzma) |
-| KDD_2019 | [Part 1.1 RNN Feature Engineering](conference_notebooks/KDD_2019/plasticc/Part_1-1_RNN_Feature_Engineering.ipynb) | Part 1.1 of this GPU only based notebook shows the RAPIDS speedup of the the RAPIDS.ai team's model that placed 8/1094 in the PLAsTiCC Astronomical Classification competition. [Blog](https://medium.com/rapids-ai/make-sense-of-the-universe-with-rapids-ai-d105b0e5ec95). - [Introduction found here.](conference_notebooks/KDD_2019/plasticc/Introduction.ipynb) - [Exercise Answers found here](conference_notebooks/KDD_2019/plasticc/Exercise_Answers.ipynb) - [Original submission found here](competition_notebooks/kaggle/plasticc/notebooks/rapids_lsst_gpu_only_demo.ipynb) | MG | [Kaggle PLAsTiCC-2018 dataset](https://www.kaggle.com/c/PLAsTiCC-2018/data) |
-| KDD_2019 | [Part 1.2 RNN Extract Bottleneck](conference_notebooks/KDD_2019/plasticc/Part_1-2_RNN_Extract_Bottleneck.ipynb) | Part 1.2 of this GPU only based notebook shows the RAPIDS speedup of the RAPIDS.ai team's model that placed 8/1094 in the PLAsTiCC Astronomical Classification competition. [Blog](https://medium.com/rapids-ai/make-sense-of-the-universe-with-rapids-ai-d105b0e5ec95). - [Introduction found here.](conference_notebooks/KDD_2019/plasticc/Introduction.ipynb) - [Exercise Answers found here](conference_notebooks/KDD_2019/plasticc/Exercise_Answers.ipynb) - [Original submission found here](competition_notebooks/kaggle/plasticc/notebooks/rapids_lsst_gpu_only_demo.ipynb) | MG | [Kaggle PLAsTiCC-2018 dataset](https://www.kaggle.com/c/PLAsTiCC-2018/data) |
-| KDD_2019 | [Part 2.1 Feature Engineering](contrib/conference_notebooks/KDD_2019/plasticc/Part_2-1_Feature_Engineering.ipynb) | Part 2.1 of this GPU only based notebook shows the RAPIDS speedup of the the RAPIDS.ai team's model that placed 8/1094 in the PLAsTiCC Astronomical Classification competition. [Blog](https://medium.com/rapids-ai/make-sense-of-the-universe-with-rapids-ai-d105b0e5ec95). - [Introduction found here.](conference_notebooks/KDD_2019/plasticc/Introduction.ipynb) - [Exercise Answers found here](conference_notebooks/KDD_2019/plasticc/Exercise_Answers.ipynb) - [Original submission found here](competition_notebooks/kaggle/plasticc/notebooks/rapids_lsst_gpu_only_demo.ipynb) | MG | [Kaggle PLAsTiCC-2018 dataset](https://www.kaggle.com/c/PLAsTiCC-2018/data) |
-| KDD_2019 | [Part 2.2 Train XGBoost & MLP](conference_notebooks/KDD_2019/plasticc/Part_2-2_Train_XGBoost_&_MLP.ipynb) | Part 2.2 of this GPU only based notebook shows the RAPIDS speedup of the the RAPIDS.ai team's model that placed 8/1094 in the PLAsTiCC Astronomical Classification competition. [Blog](https://medium.com/rapids-ai/make-sense-of-the-universe-with-rapids-ai-d105b0e5ec95). - [Introduction found here.](conference_notebooks/KDD_2019/plasticc/Introduction.ipynb) - [Exercise Answers found here](conference_notebooks/KDD_2019/plasticc/Exercise_Answers.ipynb) - [Original submission found here](competition_notebooks/kaggle/plasticc/notebooks/rapids_lsst_gpu_only_demo.ipynb) | MG | [Kaggle PLAsTiCC-2018 dataset](https://www.kaggle.com/c/PLAsTiCC-2018/data) |
-| | | | | |
-| SCIPY_2019 | [SCIPY_2019 Tutorial Index](conference_notebooks/SCIPY_2019/index.ipynb) | This index outlines the "getting started" style tutorials within the folder. The tutorials cover cudf, cuml, and cugraph. These tutorials were presented at SCIPY 2019 | SG | Various Self Generated datasets and Zachary Karate Club Data Set |
-| | | | | |
-| ASONAM 2019 | [Cyber](conference_notebooks/ASONAM_2019/Cyber.ipynb) | Example notebook using RAPIDS to let an organization's security and forensics experts collect vast amounts of network traffic and network metadata and perform fast triage, processing, modeling, and visualization capabilities. | MG | [IDS 2018 dataset](https://www.unb.ca/cic/datasets/ids-2018.html) from the [Canadian Institute for Cybersecurity](https://www.unb.ca/cic/) |
-| ASONAM 2019 | [Spotify Playlist](conference_notebooks/ASONAM_2019/Spotify_Playlist.ipynb) | Shows how you can quickly use RAPIDS to explore the Spotify Million Playlist Dataset, which was created for the RecSys 2018 competition, and build a playlist recommender **Note: this dataset requires an independent user download and cannot be pulled from the notebook** | MG | RecSys 2018 competition |
-| ASONAM 2019 | [Weighted Link Prediction](conference_notebooks/ASONAM_2019/Weighted_Link_Prediction.ipynb) | This notebook uses cuGraph for Weighted Link Prediction to mitigate uncertainty on the Epinions Trust Network Dataset to predict the likelihood of trust or distrust between vertices. **Note: this dataset requires an independent user download and cannot be pulled from the notebook** | SG | Epinions Trust Network Dataset |
-| | | | | |
-| KDD 2020 | [KDD 2020](conference_notebooks/KDD_2020/README.md) | Conference material for the KDD 2020 hands-on tutorial | SG | |
-| KDD 2020 | [Taxi](conference_notebooks/KDD_2020/notebooks/Taxi/NYCTax.ipynb) | Analysis of the New York City Taxi dataset. Introductory notebook showing ETL, Statistical Analysis, Machine Learning, Graph, and Visualization | SG | 2016 New York Taxi Data |
-| KDD 2020 | [Tabular](conference_notebooks/KDD_2020/notebooks/nvtabular/rossmann-store-sales-example.ipynb) | Perform store sales prediction using tabular deep learning | SG | [ Kaggle Rossmann Store Sales competition](https://www.kaggle.com/c/rossmann-store-sales) |
-| KDD 2020 | [Cell RNA](conference_notebooks/KDD_2020/notebooks/Lungs/hlca_lung_gpu_analysis.ipynb) | Single-Cell RNA Sequencing Analysis | SG | human lung cells from [Travaglini et al. 2020](https://www.biorxiv.org/content/10.1101/742320v2) |
-| KDD 2020 | Parking | Analyzing Seattle Parking data and determining the best parking spot within a walkable distance from Space Needle | SG | |
-| KDD 2020 | CyBERT | Cyber Log Parsing using Neural Networks and Language Based Model | SG | |
+
+ Benchmarks
+
+* [MultiGPU PageRank Benchmark (Archived)](the_archive/archived_rapids_benchmarks/cugraph)
+* [RAPIDS Decomposition (Archived)](the_archive/archived_rapids_benchmarks/rapids_decomposition.ipynb)
+
+
+
+ Random Tips and Tricks
+
+* [Synthetic 3D End-to-End ML Workflow](community_tutorials_and_guides/synthetic)
+
+
+
+### How-Tos with our Ecosystem Partners
+
+- [BlazingSQL](#) - these notebooks supplement app.blazingsql.com and provide tutorials for local BlazingSQL workflows. Make List.
+- cuStreamz
+- [LearnRAPIDS](https://www.learnrapids.com/)
+- Graphistry
## Additional Information
* The `data` folder also includes the full image set from the [Fashion MNIST dataset](https://github.com/zalandoresearch/fashion-mnist).
-* `utils`: contains a set of useful scripts for interacting with RAPIDS Notebooks-Contrib
+* `utils`: contains a set of useful scripts for interacting with RAPIDS Community Notebooks
+
+* For our notebook examples and tutorials found on [github](https://github.com/rapidsai), in each respective repo.
-* For our notebook examples and tutorials found in our standard containers, please see the [Notebooks Repo](https://github.com/rapidsai/notebooks)
diff --git a/intermediate_notebooks/examples/blazingsql/README.md b/community_tutorials_and_guides/blazingsql/README.md
similarity index 100%
rename from intermediate_notebooks/examples/blazingsql/README.md
rename to community_tutorials_and_guides/blazingsql/README.md
diff --git a/intermediate_notebooks/examples/blazingsql/taxi_fare_prediction.ipynb b/community_tutorials_and_guides/blazingsql/bsql_taxi_fare_prediction.ipynb
similarity index 87%
rename from intermediate_notebooks/examples/blazingsql/taxi_fare_prediction.ipynb
rename to community_tutorials_and_guides/blazingsql/bsql_taxi_fare_prediction.ipynb
index 483b325e..59b90e59 100644
--- a/intermediate_notebooks/examples/blazingsql/taxi_fare_prediction.ipynb
+++ b/community_tutorials_and_guides/blazingsql/bsql_taxi_fare_prediction.ipynb
@@ -21,7 +21,7 @@
"metadata": {},
"source": [
"#### BlazingSQL install check\n",
- "The next cell checks that you have BlazingSQL installed, and offers to install it if not (making sure the notebook will run as expected)."
+ "The next cell checks to determine if you have BlazingSQL installed. If you do not have BlazingSQL installed, please first install RAPIDS and BlazingSQL via your preferred installation method (Docker or conda) from our [Release Selector](https://rapids.ai/start.html#rapids-release-selector). "
]
},
{
@@ -42,10 +42,10 @@
],
"source": [
"import sys \n",
- "# point import path notebooks-contrib/utils\n",
- "sys.path.append('../../../utils/')\n",
+ "# point import path notebooks-contrib/utils\n",
+ "sys.path.append('../../utils/')\n",
"from sql_check import bsql_start\n",
- "# check that BlazingSQL is installed\n",
+ "# check that BlazingSQL is installed\n",
"bsql_start()"
]
},
@@ -118,12 +118,12 @@
"import urllib.request\n",
"\n",
"# relative path to data folder\n",
- "data_dir = '../../../data/blazingsql/'\n",
+ "data_dir = '../../utils/blazingsql/'\n",
"# does folder exist?\n",
"if not os.path.exists(data_dir):\n",
" print('creating blazingsql directory')\n",
" # create folder\n",
- " os.system('mkdir ../../data/blazingsql')"
+ " os.system('mkdir ../../utils/blazingsql')"
]
},
{
@@ -135,13 +135,20 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "Downloading https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_00.csv to ../../../data/blazingsql/taxi_00.csv\n",
- "Downloading https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_01.csv to ../../../data/blazingsql/taxi_01.csv\n",
- "Downloading https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_02.csv to ../../../data/blazingsql/taxi_02.csv\n",
- "Downloading https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_03.csv to ../../../data/blazingsql/taxi_03.csv\n"
+ "blazingsql __pycache__ sql_check.py\n",
+ "env-check.py rapids-colab.sh update_pyarrow.py\n"
]
}
],
+ "source": [
+ "!ls ../../utils/"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [],
"source": [
"# download taxi data\n",
"base_url = 'https://blazingsql-colab.s3.amazonaws.com/taxi_data/'\n",
@@ -169,7 +176,7 @@
},
{
"cell_type": "code",
- "execution_count": 6,
+ "execution_count": 7,
"metadata": {
"colab": {},
"colab_type": "code",
@@ -277,7 +284,7 @@
"4 -74.012459 40.713932 1.0 "
]
},
- "execution_count": 6,
+ "execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
@@ -307,7 +314,22 @@
},
{
"cell_type": "code",
- "execution_count": 7,
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# delete used dataframes gdf00 etc \n",
+ "del gdf_00\n",
+ "del gdf_01\n",
+ "del gdf_02\n",
+ "del gdf_03\n",
+ "\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
"metadata": {},
"outputs": [
{
@@ -438,7 +460,7 @@
"max 3.537133e+03 2.080000e+02 "
]
},
- "execution_count": 7,
+ "execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
@@ -459,7 +481,7 @@
},
{
"cell_type": "code",
- "execution_count": 8,
+ "execution_count": 10,
"metadata": {
"colab": {},
"colab_type": "code",
@@ -470,25 +492,15 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "CPU times: user 2 µs, sys: 1 µs, total: 3 µs\n",
+ "CPU times: user 3 µs, sys: 1e+03 ns, total: 4 µs\n",
"Wall time: 6.2 µs\n"
]
- },
- {
- "data": {
- "text/plain": [
- ""
- ]
- },
- "execution_count": 8,
- "metadata": {},
- "output_type": "execute_result"
}
],
"source": [
"%time\n",
"# make a table from the combined df\n",
- "bc.create_table('train_taxi', gdf, column_names=col_names)"
+ "bc.create_table('train_taxi', gdf)"
]
},
{
@@ -503,7 +515,7 @@
},
{
"cell_type": "code",
- "execution_count": 9,
+ "execution_count": 11,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
@@ -615,7 +627,7 @@
"4 1.0 "
]
},
- "execution_count": 9,
+ "execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
@@ -649,7 +661,7 @@
},
{
"cell_type": "code",
- "execution_count": 10,
+ "execution_count": 12,
"metadata": {},
"outputs": [
{
@@ -710,7 +722,7 @@
"4 10.5"
]
},
- "execution_count": 10,
+ "execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
@@ -735,7 +747,7 @@
},
{
"cell_type": "code",
- "execution_count": 11,
+ "execution_count": 13,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
@@ -751,20 +763,20 @@
"output_type": "stream",
"text": [
"Coefficients:\n",
- "0 -0.027290\n",
- "1 0.003329\n",
- "2 0.106803\n",
- "3 0.637564\n",
+ "0 -0.027293\n",
+ "1 0.003330\n",
+ "2 0.106819\n",
+ "3 0.637570\n",
"4 0.000871\n",
"5 -0.000516\n",
- "6 0.092400\n",
+ "6 0.092438\n",
"dtype: float32\n",
"\n",
"Y intercept:\n",
- "3.3568549156188965\n",
+ "3.356637954711914\n",
"\n",
- "CPU times: user 689 ms, sys: 590 ms, total: 1.28 s\n",
- "Wall time: 1.28 s\n"
+ "CPU times: user 421 ms, sys: 120 ms, total: 541 ms\n",
+ "Wall time: 540 ms\n"
]
}
],
@@ -796,31 +808,31 @@
},
{
"cell_type": "code",
- "execution_count": 12,
+ "execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
- "--2020-01-21 17:21:52-- https://blazingsql-demos.s3-us-west-1.amazonaws.com/test.csv\n",
- "Resolving blazingsql-demos.s3-us-west-1.amazonaws.com (blazingsql-demos.s3-us-west-1.amazonaws.com)... 52.219.120.105\n",
- "Connecting to blazingsql-demos.s3-us-west-1.amazonaws.com (blazingsql-demos.s3-us-west-1.amazonaws.com)|52.219.120.105|:443... connected.\n",
+ "--2021-04-09 07:34:59-- https://blazingsql-demos.s3-us-west-1.amazonaws.com/test.csv\n",
+ "Resolving blazingsql-demos.s3-us-west-1.amazonaws.com (blazingsql-demos.s3-us-west-1.amazonaws.com)... 52.219.112.217\n",
+ "Connecting to blazingsql-demos.s3-us-west-1.amazonaws.com (blazingsql-demos.s3-us-west-1.amazonaws.com)|52.219.112.217|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 982916 (960K) [text/csv]\n",
- "Saving to: ‘../../../data/blazingsql/test.csv’\n",
+ "Saving to: ‘../../data/blazingsql/test.csv.5’\n",
"\n",
- "test.csv 100%[===================>] 959.88K 2.56MB/s in 0.4s \n",
+ "test.csv.5 100%[===================>] 959.88K 2.64MB/s in 0.4s \n",
"\n",
- "2020-01-21 17:21:53 (2.56 MB/s) - ‘../../../data/blazingsql/test.csv’ saved [982916/982916]\n",
+ "2021-04-09 07:35:00 (2.64 MB/s) - ‘../../data/blazingsql/test.csv.5’ saved [982916/982916]\n",
"\n"
]
}
],
"source": [
"# do we have Test taxi file?\n",
- "if not os.path.isfile('../../../data/blazingsql/test.csv'):\n",
- " !wget -P ../../../data/blazingsql https://blazingsql-demos.s3-us-west-1.amazonaws.com/test.csv"
+ "if not os.path.isfile('../../utils/blazingsql/test.csv'):\n",
+ " !wget -P ../../data/blazingsql https://blazingsql-demos.s3-us-west-1.amazonaws.com/test.csv"
]
},
{
@@ -832,16 +844,16 @@
},
{
"cell_type": "code",
- "execution_count": 13,
+ "execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
- "'/home/jupyter-winston/notebooks-contrib/data/blazingsql/test.csv'"
+ "'/rapids/notebooks-contrib//data/blazingsql/test.csv'"
]
},
- "execution_count": 13,
+ "execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
@@ -849,47 +861,36 @@
"source": [
"# identify path to this notebook, !pwd returns SList w/ path (str) at 0th index\n",
"path = !pwd\n",
- "# extract path notebooks-contrib then\n",
- "path = path[0].split('intermediate_notebooks')[0] \n",
+ "# extract path community_tutorials_and_guides/blazingsql then\n",
+ "path = path[0].split('community_tutorials_and_guides/blazingsql')[0] \n",
"# add path to data from there\n",
- "path = path + 'data/blazingsql/' + 'test.csv'\n",
+ "path = path + '/data/blazingsql/' + 'test.csv'\n",
"# how's it look?\n",
"path"
]
},
{
"cell_type": "code",
- "execution_count": 14,
+ "execution_count": 16,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "yRM5PosNiuGh"
},
- "outputs": [
- {
- "data": {
- "text/plain": [
- ""
- ]
- },
- "execution_count": 14,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
+ "outputs": [],
"source": [
"# set column names and types\n",
"col_names = ['key', 'fare_amount', 'pickup_longitude', 'pickup_latitude', \n",
" 'dropoff_longitude', 'dropoff_latitude', 'passenger_count']\n",
"col_types = ['date64', 'float32', 'float32', 'float32', 'float32', 'float32', 'float32']\n",
"\n",
- "# create test table directly from CSV\n",
- "bc.create_table('test_taxi', path, names=col_names, dtype=col_types)"
+ "# create test table directly from CSV - this doesnt make sense\n",
+ "bc.create_table('test_taxi', path, names=col_names, dtype=col_types)\n"
]
},
{
"cell_type": "code",
- "execution_count": 15,
+ "execution_count": 17,
"metadata": {
"colab": {},
"colab_type": "code",
@@ -997,7 +998,7 @@
"4 1.0 "
]
},
- "execution_count": 15,
+ "execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
@@ -1031,7 +1032,7 @@
},
{
"cell_type": "code",
- "execution_count": 16,
+ "execution_count": 18,
"metadata": {
"colab": {},
"colab_type": "code",
@@ -1041,71 +1042,21 @@
{
"data": {
"text/plain": [
- "0 12.854630\n",
- "1 12.854605\n",
- "2 11.256927\n",
- "3 11.811884\n",
- "4 11.811888\n",
- "5 11.811880\n",
- "6 11.222965\n",
- "7 11.222733\n",
- "8 11.222973\n",
- "9 12.239309\n",
- "10 12.239325\n",
- "11 12.239347\n",
- "12 9.696036\n",
- "13 9.696022\n",
- "14 11.468582\n",
- "15 11.468594\n",
- "16 11.460928\n",
- "17 11.460958\n",
- "18 11.460936\n",
- "19 11.460926\n",
- "20 13.485119\n",
- "21 12.707811\n",
- "22 12.707788\n",
- "23 12.707800\n",
- "24 12.707800\n",
- "25 12.707785\n",
- "26 12.707952\n",
- "27 12.707806\n",
- "28 12.707804\n",
- "29 12.707785\n",
+ "0 12.854544\n",
+ "1 12.854520\n",
+ "2 11.256961\n",
+ "3 11.811929\n",
+ "4 11.811933\n",
" ... \n",
- "9884 12.643631\n",
- "9885 12.643671\n",
- "9886 12.643652\n",
- "9887 12.643633\n",
- "9888 12.643650\n",
- "9889 12.643656\n",
- "9890 12.643648\n",
- "9891 12.643673\n",
- "9892 12.643652\n",
- "9893 12.643667\n",
- "9894 12.643648\n",
- "9895 12.643719\n",
- "9896 12.643631\n",
- "9897 13.454716\n",
- "9898 13.212105\n",
- "9899 14.138895\n",
- "9900 13.368757\n",
- "9901 13.635015\n",
- "9902 14.171509\n",
- "9903 13.832354\n",
- "9904 13.669437\n",
- "9905 13.259691\n",
- "9906 14.138172\n",
- "9907 13.452593\n",
- "9908 13.717201\n",
- "9909 13.714552\n",
- "9910 13.157532\n",
- "9911 13.419586\n",
- "9912 13.657433\n",
- "9913 13.259361\n",
+ "9909 13.714720\n",
+ "9910 13.157619\n",
+ "9911 13.419721\n",
+ "9912 13.657573\n",
+ "9913 13.259460\n",
"Length: 9914, dtype: float32"
]
},
- "execution_count": 16,
+ "execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
@@ -1120,7 +1071,7 @@
},
{
"cell_type": "code",
- "execution_count": 17,
+ "execution_count": 19,
"metadata": {
"colab": {},
"colab_type": "code",
@@ -1168,7 +1119,7 @@
"
-0.008110
\n",
"
-0.019970
\n",
"
1.0
\n",
- "
12.854630
\n",
+ "
12.854544
\n",
" \n",
"
\n",
"
1
\n",
@@ -1179,7 +1130,7 @@
"
-0.012024
\n",
"
0.019814
\n",
"
1.0
\n",
- "
12.854605
\n",
+ "
12.854520
\n",
"
\n",
"
\n",
"
2
\n",
@@ -1190,7 +1141,7 @@
"
0.002869
\n",
"
-0.005119
\n",
"
1.0
\n",
- "
11.256927
\n",
+ "
11.256961
\n",
"
\n",
"
\n",
"
3
\n",
@@ -1201,7 +1152,7 @@
"
-0.009277
\n",
"
-0.016178
\n",
"
1.0
\n",
- "
11.811884
\n",
+ "
11.811929
\n",
"
\n",
"
\n",
"
4
\n",
@@ -1212,7 +1163,7 @@
"
-0.022537
\n",
"
-0.045345
\n",
"
1.0
\n",
- "
11.811888
\n",
+ "
11.811933
\n",
"
\n",
" \n",
"\n",
@@ -1227,14 +1178,14 @@
"4 21.0 1.0 12.0 12.0 -0.022537 -0.045345 \n",
"\n",
" passenger_count predicted_fare \n",
- "0 1.0 12.854630 \n",
- "1 1.0 12.854605 \n",
- "2 1.0 11.256927 \n",
- "3 1.0 11.811884 \n",
- "4 1.0 11.811888 "
+ "0 1.0 12.854544 \n",
+ "1 1.0 12.854520 \n",
+ "2 1.0 11.256961 \n",
+ "3 1.0 11.811929 \n",
+ "4 1.0 11.811933 "
]
},
- "execution_count": 17,
+ "execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
@@ -1246,6 +1197,20 @@
"# how's that look?\n",
"X_test.head()"
]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
}
],
"metadata": {
@@ -1270,7 +1235,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.6.7"
+ "version": "3.7.10"
}
},
"nbformat": 4,
diff --git a/intermediate_notebooks/examples/blazingsql/vs_pyspark_netflow.ipynb b/community_tutorials_and_guides/blazingsql/bsql_vs_pyspark_netflow.ipynb
similarity index 60%
rename from intermediate_notebooks/examples/blazingsql/vs_pyspark_netflow.ipynb
rename to community_tutorials_and_guides/blazingsql/bsql_vs_pyspark_netflow.ipynb
index e50b7fb1..dd8a7760 100644
--- a/intermediate_notebooks/examples/blazingsql/vs_pyspark_netflow.ipynb
+++ b/community_tutorials_and_guides/blazingsql/bsql_vs_pyspark_netflow.ipynb
@@ -21,7 +21,7 @@
"metadata": {},
"source": [
"#### BlazingSQL install check\n",
- "The next cell checks that you have BlazingSQL installed, and offers to install it if not (making sure the notebook will run as expected)."
+ "The next cell checks to determine if you have BlazingSQL installed. If you do not have BlazingSQL installed, please first install RAPIDS and BlazingSQL via your preferred installation method (Docker or conda) from our [Release Selector](https://rapids.ai/start.html#rapids-release-selector). "
]
},
{
@@ -43,7 +43,7 @@
"source": [
"import sys \n",
"# point import path notebooks-contrib/utils\n",
- "sys.path.append('../../../utils/')\n",
+ "sys.path.append('../../utils') \n",
"from sql_check import bsql_start\n",
"# check that BlazingSQL is installed\n",
"bsql_start()"
@@ -66,17 +66,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "--2020-01-21 17:27:12-- https://blazingsql-colab.s3.amazonaws.com/netflow_data/nf-chunk2.csv\n",
- "Resolving blazingsql-colab.s3.amazonaws.com (blazingsql-colab.s3.amazonaws.com)... 52.216.115.3\n",
- "Connecting to blazingsql-colab.s3.amazonaws.com (blazingsql-colab.s3.amazonaws.com)|52.216.115.3|:443... connected.\n",
- "HTTP request sent, awaiting response... 200 OK\n",
- "Length: 2725056295 (2.5G) [text/csv]\n",
- "Saving to: ‘../../../data/blazingsql/nf-chunk2.csv’\n",
- "\n",
- "nf-chunk2.csv 100%[===================>] 2.54G 43.5MB/s in 56s \n",
- "\n",
- "2020-01-21 17:28:14 (46.0 MB/s) - ‘../../../data/blazingsql/nf-chunk2.csv’ saved [2725056295/2725056295]\n",
- "\n"
+ "You've got the data!\n"
]
}
],
@@ -84,7 +74,7 @@
"import os\n",
"\n",
"# relative path to data folder\n",
- "data_dir = '../../../data/blazingsql/'\n",
+ "data_dir = '../../data/blazingsql/'\n",
"# file name\n",
"fn = 'nf-chunk2.csv'\n",
"\n",
@@ -97,7 +87,7 @@
"# do we have music file?\n",
"if not os.path.isfile(data_dir + fn):\n",
" # save nf-chunk2 to data folder, may take a few minutes to download (21,526,138 records)\n",
- " !wget -P ../../../data/blazingsql https://blazingsql-colab.s3.amazonaws.com/netflow_data/nf-chunk2.csv\n",
+ " !wget -P ../../data/blazingsql https://blazingsql-colab.s3.amazonaws.com/netflow_data/nf-chunk2.csv\n",
"else:\n",
" print(\"You've got the data!\")"
]
@@ -166,8 +156,8 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "CPU times: user 3.75 s, sys: 1.19 s, total: 4.94 s\n",
- "Wall time: 4.93 s\n"
+ "CPU times: user 1.63 s, sys: 236 ms, total: 1.87 s\n",
+ "Wall time: 1.86 s\n"
]
}
],
@@ -194,19 +184,9 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "CPU times: user 4.11 ms, sys: 23 µs, total: 4.13 ms\n",
- "Wall time: 3.32 ms\n"
+ "CPU times: user 1.36 s, sys: 43.5 ms, total: 1.4 s\n",
+ "Wall time: 545 ms\n"
]
- },
- {
- "data": {
- "text/plain": [
- ""
- ]
- },
- "execution_count": 5,
- "metadata": {},
- "output_type": "execute_result"
}
],
"source": [
@@ -232,8 +212,8 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "CPU times: user 1.98 s, sys: 514 ms, total: 2.49 s\n",
- "Wall time: 1.95 s\n"
+ "CPU times: user 567 ms, sys: 52.4 ms, total: 619 ms\n",
+ "Wall time: 290 ms\n"
]
}
],
@@ -307,152 +287,152 @@
" \n",
"
\n",
"
0
\n",
- "
172.10.1.162
\n",
+ "
172.10.1.33
\n",
"
10.0.0.11
\n",
- "
87
\n",
- "
39628
\n",
- "
53983
\n",
- "
24
\n",
- "
2013-04-03 06:50:13
\n",
- "
2013-04-03 14:58:35
\n",
- "
87
\n",
+ "
110
\n",
+ "
49886
\n",
+ "
69630
\n",
+ "
0
\n",
+ "
2013-04-03 06:51:58
\n",
+ "
2013-04-03 14:45:47
\n",
+ "
110
\n",
"
\n",
"
\n",
"
1
\n",
- "
172.30.2.60
\n",
- "
10.0.0.9
\n",
- "
82
\n",
- "
34839
\n",
- "
47716
\n",
- "
134
\n",
- "
2013-04-03 06:48:47
\n",
- "
2013-04-03 12:12:37
\n",
- "
82
\n",
+ "
172.30.1.126
\n",
+ "
239.255.255.250
\n",
+ "
9
\n",
+ "
2275
\n",
+ "
0
\n",
+ "
12
\n",
+ "
2013-04-03 06:35:52
\n",
+ "
2013-04-03 12:05:31
\n",
+ "
9
\n",
"
\n",
"
\n",
"
2
\n",
- "
172.30.1.56
\n",
- "
172.0.0.1
\n",
- "
25
\n",
- "
3330
\n",
- "
3240
\n",
- "
67
\n",
- "
2013-04-03 01:59:09
\n",
- "
2013-04-03 22:05:39
\n",
- "
25
\n",
+ "
172.30.2.133
\n",
+ "
239.255.255.250
\n",
+ "
5
\n",
+ "
1225
\n",
+ "
0
\n",
+ "
6
\n",
+ "
2013-04-03 06:36:08
\n",
+ "
2013-04-03 06:36:15
\n",
+ "
5
\n",
"
\n",
"
\n",
"
3
\n",
- "
172.10.1.234
\n",
- "
10.0.0.5
\n",
- "
104
\n",
- "
47287
\n",
- "
64750
\n",
- "
18
\n",
- "
2013-04-03 06:53:55
\n",
- "
2013-04-03 15:11:07
\n",
- "
104
\n",
+ "
172.30.1.149
\n",
+ "
10.0.0.9
\n",
+ "
78
\n",
+ "
35309
\n",
+ "
48556
\n",
+ "
30
\n",
+ "
2013-04-03 06:48:27
\n",
+ "
2013-04-03 11:52:55
\n",
+ "
78
\n",
"
\n",
"
\n",
"
4
\n",
- "
10.1.0.76
\n",
- "
172.10.1.82
\n",
+ "
10.0.0.13
\n",
+ "
172.10.1.81
\n",
"
1
\n",
"
633
\n",
"
392
\n",
"
0
\n",
- "
2013-04-03 09:55:05
\n",
- "
2013-04-03 09:55:05
\n",
+ "
2013-04-03 09:48:26
\n",
+ "
2013-04-03 09:48:26
\n",
"
1
\n",
"
\n",
"
\n",
"
5
\n",
- "
172.30.1.85
\n",
- "
10.0.0.8
\n",
- "
84
\n",
- "
37828
\n",
- "
52864
\n",
+ "
172.10.0.5
\n",
+ "
10.247.58.129
\n",
+ "
3
\n",
+ "
1617
\n",
+ "
108
\n",
+ "
4
\n",
+ "
2013-04-03 10:16:11
\n",
+ "
2013-04-03 11:37:15
\n",
"
3
\n",
- "
2013-04-03 06:48:21
\n",
- "
2013-04-03 12:06:53
\n",
- "
84
\n",
"
\n",
"
\n",
"
6
\n",
- "
172.30.1.10
\n",
- "
10.0.0.12
\n",
- "
69
\n",
- "
31042
\n",
- "
43044
\n",
- "
25
\n",
- "
2013-04-03 06:48:01
\n",
- "
2013-04-03 12:11:40
\n",
- "
69
\n",
+ "
10.0.0.14
\n",
+ "
172.10.2.143
\n",
+ "
1
\n",
+ "
571
\n",
+ "
108
\n",
+ "
0
\n",
+ "
2013-04-03 10:13:57
\n",
+ "
2013-04-03 10:13:57
\n",
+ "
1
\n",
"
\n",
"
\n",
"
7
\n",
- "
172.30.1.201
\n",
- "
172.0.0.1
\n",
- "
29
\n",
- "
2610
\n",
- "
2610
\n",
- "
0
\n",
- "
2013-04-03 00:26:46
\n",
- "
2013-04-03 23:06:00
\n",
- "
29
\n",
+ "
172.10.1.2
\n",
+ "
10.0.0.10
\n",
+ "
97
\n",
+ "
44092
\n",
+ "
61401
\n",
+ "
2
\n",
+ "
2013-04-03 06:48:54
\n",
+ "
2013-04-03 15:05:37
\n",
+ "
97
\n",
"
\n",
"
\n",
"
8
\n",
- "
172.30.2.125
\n",
+ "
172.10.1.212
\n",
"
10.0.0.9
\n",
- "
69
\n",
- "
30701
\n",
- "
41558
\n",
- "
341
\n",
- "
2013-04-03 06:50:50
\n",
- "
2013-04-03 12:12:37
\n",
- "
69
\n",
+ "
102
\n",
+ "
46260
\n",
+ "
64410
\n",
+ "
23
\n",
+ "
2013-04-03 06:50:02
\n",
+ "
2013-04-03 14:31:50
\n",
+ "
102
\n",
"
\n",
"
\n",
"
9
\n",
- "
172.10.1.89
\n",
- "
10.0.0.5
\n",
- "
112
\n",
- "
51222
\n",
- "
70260
\n",
- "
24
\n",
- "
2013-04-03 06:48:24
\n",
- "
2013-04-03 15:17:39
\n",
- "
112
\n",
+ "
172.30.1.160
\n",
+ "
10.0.0.12
\n",
+ "
65
\n",
+ "
29402
\n",
+ "
40520
\n",
+ "
16
\n",
+ "
2013-04-03 06:55:18
\n",
+ "
2013-04-03 11:52:13
\n",
+ "
65
\n",
"
\n",
" \n",
"\n",
""
],
"text/plain": [
- " source destination targetPorts bytesOut bytesIn durationSeconds \\\n",
- "0 172.10.1.162 10.0.0.11 87 39628 53983 24 \n",
- "1 172.30.2.60 10.0.0.9 82 34839 47716 134 \n",
- "2 172.30.1.56 172.0.0.1 25 3330 3240 67 \n",
- "3 172.10.1.234 10.0.0.5 104 47287 64750 18 \n",
- "4 10.1.0.76 172.10.1.82 1 633 392 0 \n",
- "5 172.30.1.85 10.0.0.8 84 37828 52864 3 \n",
- "6 172.30.1.10 10.0.0.12 69 31042 43044 25 \n",
- "7 172.30.1.201 172.0.0.1 29 2610 2610 0 \n",
- "8 172.30.2.125 10.0.0.9 69 30701 41558 341 \n",
- "9 172.10.1.89 10.0.0.5 112 51222 70260 24 \n",
+ " source destination targetPorts bytesOut bytesIn \\\n",
+ "0 172.10.1.33 10.0.0.11 110 49886 69630 \n",
+ "1 172.30.1.126 239.255.255.250 9 2275 0 \n",
+ "2 172.30.2.133 239.255.255.250 5 1225 0 \n",
+ "3 172.30.1.149 10.0.0.9 78 35309 48556 \n",
+ "4 10.0.0.13 172.10.1.81 1 633 392 \n",
+ "5 172.10.0.5 10.247.58.129 3 1617 108 \n",
+ "6 10.0.0.14 172.10.2.143 1 571 108 \n",
+ "7 172.10.1.2 10.0.0.10 97 44092 61401 \n",
+ "8 172.10.1.212 10.0.0.9 102 46260 64410 \n",
+ "9 172.30.1.160 10.0.0.12 65 29402 40520 \n",
"\n",
- " firstFlowDate lastFlowDate attemptCount \n",
- "0 2013-04-03 06:50:13 2013-04-03 14:58:35 87 \n",
- "1 2013-04-03 06:48:47 2013-04-03 12:12:37 82 \n",
- "2 2013-04-03 01:59:09 2013-04-03 22:05:39 25 \n",
- "3 2013-04-03 06:53:55 2013-04-03 15:11:07 104 \n",
- "4 2013-04-03 09:55:05 2013-04-03 09:55:05 1 \n",
- "5 2013-04-03 06:48:21 2013-04-03 12:06:53 84 \n",
- "6 2013-04-03 06:48:01 2013-04-03 12:11:40 69 \n",
- "7 2013-04-03 00:26:46 2013-04-03 23:06:00 29 \n",
- "8 2013-04-03 06:50:50 2013-04-03 12:12:37 69 \n",
- "9 2013-04-03 06:48:24 2013-04-03 15:17:39 112 "
+ " durationSeconds firstFlowDate lastFlowDate attemptCount \n",
+ "0 0 2013-04-03 06:51:58 2013-04-03 14:45:47 110 \n",
+ "1 12 2013-04-03 06:35:52 2013-04-03 12:05:31 9 \n",
+ "2 6 2013-04-03 06:36:08 2013-04-03 06:36:15 5 \n",
+ "3 30 2013-04-03 06:48:27 2013-04-03 11:52:55 78 \n",
+ "4 0 2013-04-03 09:48:26 2013-04-03 09:48:26 1 \n",
+ "5 4 2013-04-03 10:16:11 2013-04-03 11:37:15 3 \n",
+ "6 0 2013-04-03 10:13:57 2013-04-03 10:13:57 1 \n",
+ "7 2 2013-04-03 06:48:54 2013-04-03 15:05:37 97 \n",
+ "8 23 2013-04-03 06:50:02 2013-04-03 14:31:50 102 \n",
+ "9 16 2013-04-03 06:55:18 2013-04-03 11:52:13 65 "
]
},
"execution_count": 7,
@@ -478,7 +458,7 @@
},
{
"cell_type": "code",
- "execution_count": 11,
+ "execution_count": 8,
"metadata": {
"colab": {},
"colab_type": "code",
@@ -490,18 +470,17 @@
"output_type": "stream",
"text": [
"Collecting pyspark\n",
- "\u001b[?25l Downloading https://files.pythonhosted.org/packages/87/21/f05c186f4ddb01d15d0ddc36ef4b7e3cedbeb6412274a41f26b55a650ee5/pyspark-2.4.4.tar.gz (215.7MB)\n",
- "\u001b[K |################################| 215.7MB 77.0MB/s eta 0:00:011\n",
- "\u001b[?25hCollecting py4j==0.10.7 (from pyspark)\n",
- "\u001b[?25l Downloading https://files.pythonhosted.org/packages/e3/53/c737818eb9a7dc32a7cd4f1396e787bd94200c3997c72c1dbe028587bd76/py4j-0.10.7-py2.py3-none-any.whl (197kB)\n",
- "\u001b[K |################################| 204kB 72.4MB/s eta 0:00:01\n",
+ " Using cached pyspark-3.1.1.tar.gz (212.3 MB)\n",
+ "Collecting py4j==0.10.9\n",
+ " Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)\n",
+ "\u001b[K |████████████████████████████████| 198 kB 27.8 MB/s eta 0:00:01\n",
"\u001b[?25hBuilding wheels for collected packages: pyspark\n",
" Building wheel for pyspark (setup.py) ... \u001b[?25ldone\n",
- "\u001b[?25h Created wheel for pyspark: filename=pyspark-2.4.4-py2.py3-none-any.whl size=216130387 sha256=7879a54a037a812709763c4abf7d3d85b5b9b9f8ac6278785942767dc8032f54\n",
- " Stored in directory: /root/.cache/pip/wheels/ab/09/4d/0d184230058e654eb1b04467dbc1292f00eaa186544604b471\n",
+ "\u001b[?25h Created wheel for pyspark: filename=pyspark-3.1.1-py2.py3-none-any.whl size=212767604 sha256=f924906e4f699df53b02459122288353d7e6790a0d5cb0181040406516c56b44\n",
+ " Stored in directory: /root/.cache/pip/wheels/43/47/42/bc413c760cf9d3f7b46ab7cd6590e8c47ebfd19a7386cd4a57\n",
"Successfully built pyspark\n",
"Installing collected packages: py4j, pyspark\n",
- "Successfully installed py4j-0.10.7 pyspark-2.4.4\n"
+ "Successfully installed py4j-0.10.9 pyspark-3.1.1\n"
]
}
],
@@ -523,7 +502,7 @@
},
{
"cell_type": "code",
- "execution_count": 12,
+ "execution_count": 9,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
@@ -538,8 +517,8 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "CPU times: user 73.4 ms, sys: 27.2 ms, total: 101 ms\n",
- "Wall time: 6.59 s\n"
+ "CPU times: user 39.5 ms, sys: 39.9 ms, total: 79.4 ms\n",
+ "Wall time: 6.87 s\n"
]
}
],
@@ -573,7 +552,7 @@
},
{
"cell_type": "code",
- "execution_count": 17,
+ "execution_count": 10,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
@@ -588,8 +567,8 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "CPU times: user 20.4 ms, sys: 33.4 ms, total: 53.8 ms\n",
- "Wall time: 48.8 s\n"
+ "CPU times: user 27.5 ms, sys: 33.8 ms, total: 61.3 ms\n",
+ "Wall time: 40 s\n"
]
}
],
@@ -601,7 +580,7 @@
},
{
"cell_type": "code",
- "execution_count": 18,
+ "execution_count": 11,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
@@ -616,8 +595,8 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "CPU times: user 1.87 ms, sys: 0 ns, total: 1.87 ms\n",
- "Wall time: 28.6 ms\n"
+ "CPU times: user 1.21 ms, sys: 283 µs, total: 1.49 ms\n",
+ "Wall time: 155 ms\n"
]
}
],
@@ -629,7 +608,7 @@
},
{
"cell_type": "code",
- "execution_count": 19,
+ "execution_count": 12,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
@@ -644,19 +623,19 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "+------------+---------------+-----------+--------+-------+---------------+-------------------+-------------------+------------+\n",
- "| source| destination|targetPorts|bytesOut|bytesIn|durationSeconds| firstFlowDate| lastFlowDate|attemptCount|\n",
- "+------------+---------------+-----------+--------+-------+---------------+-------------------+-------------------+------------+\n",
- "| 172.10.1.13|239.255.255.250| 15| 2975| 0| 6|2013-04-03 06:36:19|2013-04-03 06:36:27| 15|\n",
- "|172.30.1.204|239.255.255.250| 8| 1750| 0| 6|2013-04-03 06:36:13|2013-04-03 06:36:20| 8|\n",
- "| 172.30.2.86| 172.0.0.1| 1| 540| 0| 2|2013-04-03 06:36:09|2013-04-03 06:36:09| 1|\n",
- "|172.30.1.246| 172.0.0.1| 29| 2610| 2610| 0|2013-04-03 00:26:46|2013-04-03 23:06:00| 29|\n",
- "| 172.30.1.51|239.255.255.250| 16| 3850| 0| 18|2013-04-03 06:35:22|2013-04-03 06:44:08| 16|\n",
- "+------------+---------------+-----------+--------+-------+---------------+-------------------+-------------------+------------+\n",
+ "+---------+------------+-----------+--------+-------+---------------+-------------------+-------------------+------------+\n",
+ "| source| destination|targetPorts|bytesOut|bytesIn|durationSeconds| firstFlowDate| lastFlowDate|attemptCount|\n",
+ "+---------+------------+-----------+--------+-------+---------------+-------------------+-------------------+------------+\n",
+ "|10.0.0.10| 172.20.1.73| 1| 571| 108| 0|2013-04-03 10:08:30|2013-04-03 10:08:30| 1|\n",
+ "|10.0.0.10|172.30.1.221| 1| 633| 392| 0|2013-04-03 10:10:39|2013-04-03 10:10:39| 1|\n",
+ "|10.0.0.10| 172.30.2.67| 1| 633| 392| 0|2013-04-03 10:43:48|2013-04-03 10:43:48| 1|\n",
+ "|10.0.0.11| 172.20.1.55| 1| 571| 108| 0|2013-04-03 10:11:52|2013-04-03 10:11:52| 1|\n",
+ "|10.0.0.11|172.30.1.245| 3| 1837| 892| 0|2013-04-03 09:45:12|2013-04-03 11:27:32| 3|\n",
+ "+---------+------------+-----------+--------+-------+---------------+-------------------+-------------------+------------+\n",
"only showing top 5 rows\n",
"\n",
- "CPU times: user 12.1 ms, sys: 11.2 ms, total: 23.4 ms\n",
- "Wall time: 21.1 s\n"
+ "CPU times: user 33 ms, sys: 24.3 ms, total: 57.3 ms\n",
+ "Wall time: 38.8 s\n"
]
}
],
@@ -712,7 +691,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.6.7"
+ "version": "3.7.10"
}
},
"nbformat": 4,
diff --git a/community_tutorials_and_guides/census_education2income_demo.ipynb b/community_tutorials_and_guides/census_education2income_demo.ipynb
new file mode 100644
index 00000000..f724a6d0
--- /dev/null
+++ b/community_tutorials_and_guides/census_education2income_demo.ipynb
@@ -0,0 +1,1496 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Census Notebook\n",
+ "**Authorship** \n",
+ "Original Author: Taurean Dyer \n",
+ "Last Edit: Taurean Dyer, 9/26/2019 \n",
+ "\n",
+ "**Test System Specs** \n",
+ "Test System Hardware: GV100 \n",
+ "Test System Software: Ubuntu 18.04 \n",
+ "RAPIDS Version: 0.10.0a - Docker Install \n",
+ "Driver: 410.79 \n",
+ "CUDA: 10.0 \n",
+ "\n",
+ "\n",
+ "**Known Working Systems** \n",
+ "RAPIDS Versions:0.8, 0.9, 0.10\n",
+ "\n",
+ "# Intro\n",
+ "Held every 10 years, the US census gives a detailed snapshot in time about the makeup of the country. The last census in 2010 surveyed nearly 309 million people. IPUMS.org provides researchers an open source data set with 1% to 10% of the census data set. In this notebook, we want to see how education affects total income earned in the US based on data from each census from the 1970 to 2010 and see if we can predict some results if the census was held today, according to the national average. We will go through the ETL, training the model, and then testing the prediction. We'll make every effort to get as balanced of a dataset as we can. We'll also pull some extra variables to allow for further self-exploration of gender based education and income breakdowns. On a single Titan RTX, you can run the whole notebook workflow on the 4GB dataset of 14 million rows by 44 columns in less than 3 minutes. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Let's begin!**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Imports"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import pandas as pd\n",
+ "import numpy as np\n",
+ "import cuml\n",
+ "import cudf\n",
+ "import dask_cudf\n",
+ "import sys\n",
+ "import os\n",
+ "from pprint import pprint\n",
+ "import warnings\n",
+ "warnings.filterwarnings('ignore')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Get your data!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
"
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "execution_count": 1,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "from dask.distributed import Client, wait\n",
+ "from dask_cuda import LocalCUDACluster\n",
+ "import dask, dask_cudf\n",
+ "from dask.diagnostics import ProgressBar\n",
+ "\n",
+ "# Use dask-cuda to start one worker per GPU on a single-node system\n",
+ "# When you shutdown this notebook kernel, the Dask cluster also shuts down.\n",
+ "cluster = LocalCUDACluster(ip='0.0.0.0')\n",
+ "client = Client(cluster)\n",
+ "# print client info\n",
+ "client"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Ok, we've got a cluster of GPU workers. Notice also the link to the Dask status dashboard. It provides lots of useful information while running data processing tasks.\n",
+ "\n",
+ "## Accessing Data\n",
+ "\n",
+ "Now, let's download a dataset.\n",
+ "\n",
+ "If you're working on a local machine, you'd normally use wget, Python's `urllib` package, or another tool to pull down the data you want to analyze.\n",
+ "\n",
+ "For the sake of not making you wait for 200+ files to download, the cell below uses urllib to download just 20 years of weather records, and a metadata file about the stations that recorded it. You can update the `years` list if you want to download more, but it wont change the logic in the notebook either way, it'll just process more data.\n",
+ "\n",
+ "*Note*: The rest of the markdown commentary in this notebook assumes you're operating on all 232 years of data."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Make and set a home for your data"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "import urllib.request\n",
+ "\n",
+ "data_dir = '../../data/weather/'\n",
+ "if not os.path.exists(data_dir):\n",
+ " print('creating weather directory')\n",
+ " os.system('mkdir ../../data/weather')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Choose and Download your data"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# download weather observations\n",
+ "base_url = 'ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/'\n",
+ "years = list(range(2000, 2020))\n",
+ "for year in years:\n",
+ " fn = str(year) + '.csv.gz'\n",
+ " if not os.path.isfile(data_dir+fn):\n",
+ " print(f'Downloading {base_url+fn} to {data_dir+fn}')\n",
+ " urllib.request.urlretrieve(base_url+fn, data_dir+fn)\n",
+ " \n",
+ "# download weather station metadata\n",
+ "station_meta_url = 'https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt'\n",
+ "if not os.path.isfile(data_dir+'ghcnd-stations.txt'):\n",
+ " print('Downloading station meta..')\n",
+ " urllib.request.urlretrieve(station_meta_url, data_dir+'ghcnd-stations.txt')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Alternatives to Pre-Downloading Data\n",
+ "\n",
+ "While downloading or copying data to your local environment is a good way to get started, many users will want other options:\n",
+ "\n",
+ "1. Reading directly from distributed storage, like HDFS\n",
+ "2. Reading from cloud storage (S3, GCS, ADLS, etc)\n",
+ "\n",
+ "See [Dask Remote Data Services](http://docs.dask.org/en/latest/remote-data-services.html) for more details on supported providers, authentication, and other storage configuration options.\n",
+ "\n",
+ "Here's an example of reading the same weather data, conveniently available in a public Amazon S3 bucket.\n",
+ "\n",
+ "But first make sure your Python environment has the right packages to read from your storage system of choice.\n",
+ "\n",
+ "For this example: ```conda install -y s3fs```"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
Dask DataFrame Structure:
\n",
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
station_id
\n",
+ "
date
\n",
+ "
type
\n",
+ "
val
\n",
+ "
\n",
+ "
\n",
+ "
npartitions=1
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
object
\n",
+ "
int64
\n",
+ "
object
\n",
+ "
int64
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
Dask Name: read-csv, 1 tasks
"
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# these CSV files don't have headers, we specify column names manually\n",
+ "names = [\"station_id\", \"date\", \"type\", \"val\"]\n",
+ "# there are more fields, but only the first 4 are relevant in this notebook\n",
+ "usecols = names[0:4]\n",
+ "\n",
+ "url = 's3://noaa-ghcn-pds/csv/1788.csv'\n",
+ "dask_cudf.read_csv(url, names=names, usecols=usecols, storage_options={'anon': True})"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Reading Large & Multi-File DataSets\n",
+ "\n",
+ "Wait... there are many weather files: one for each year going back to the 1780s.\n",
+ "\n",
+ "Before RAPIDS 0.6, if you wanted to read all these files in, you'd need to either use a for-loop, manually concatenating dataframes, or use [`dask.delayed`](http://docs.dask.org/en/latest/delayed.html) functions that invoke cuDF.read_csv.\n",
+ "\n",
+ "Fortunately, now there's `dask_cudf.read_csv`, which supports file globs, _and_ automatically splits files into chunks that can be processed serially when needed, so you're less likely to run out of memory.\n",
+ "\n",
+ "When you call `dask_cudf.read_csv`, Dask reads metadata for each CSV file and tasks workers with lists of filenames & byte-ranges that they're responsible for loading with cuDF's GPU CSV reader.\n",
+ "\n",
+ "*Note*: compressed files are not splittable on read, but you can [repartition](https://docs.dask.org/en/latest/dataframe-best-practices.html#repartition-to-reduce-overhead) them downstream."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "/opt/conda/envs/rapids/lib/python3.7/site-packages/dask_cudf/io/csv.py:60: UserWarning: Warning gzip compression does not support breaking apart files\n",
+ "Please ensure that each individual file can fit in memory and\n",
+ "use the keyword ``chunksize=None to remove this message``\n",
+ "Setting ``chunksize=(size of file)``\n",
+ " \"Setting ``chunksize=(size of file)``\" % compression\n"
+ ]
+ }
+ ],
+ "source": [
+ "weather_ddf = dask_cudf.read_csv(data_dir+'*.csv.gz', names=names, usecols=usecols, compression='gzip')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Let's Process Some Data\n",
+ "\n",
+ "Per the [readme](https://docs.opendata.aws/noaa-ghcn-pds/readme.html) for this dataset, multiple types of weather observations are in the same files, and each carries a different units of measure:\n",
+ "\n",
+ "| Observation Type | Existing Units | Action |\n",
+ "| ------------- | ------------- | ------------- |\n",
+ "| PRCP | Precipitation (tenths of mm) | convert to inches |\n",
+ "| SNWD | Snow depth (mm) | convert to inches |\n",
+ "| TMAX | tenths of degrees C | convert to fahrenheit |\n",
+ "| TMIN | tenths of degrees C | convert to fahrenheit |\n",
+ "\n",
+ "There are more even more observation types, each with their own units of measure, but I wont list them all. In this notebook, I'm going to focus specifically on precipitation.\n",
+ "\n",
+ "The `type` column tells us what kind of weather observation each record represents. Ordinarily, you might use `query` to filter out subsets of records and apply different logic to each subset. However, [query doesn't support string datatypes yet](https://github.com/rapidsai/cudf/issues/111). Instead, you can use boolean indexing.\n",
+ "\n",
+ "For numeric types, Dask with cuDF works mostly like regular Dask. For instance, you can define new columns as combinations of other columns:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "precip_index = weather_ddf['type'] == 'PRCP'\n",
+ "precip_ddf = weather_ddf[precip_index]\n",
+ "\n",
+ "# convert 10ths of mm to inches\n",
+ "mm_to_inches = 0.0393701\n",
+ "precip_ddf['val'] = precip_ddf['val'] * 1/10 * mm_to_inches"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Note: Calling .head() will read the first few rows, usually from the first partition.\n",
+ "\n",
+ "In our case, the first partition represents weather data from 1788. Apparently, there wasn't _any_ precipitation data collected that year:\n",
+ "\n",
+ "Beware in your own analyes, that you .head() from partitions that you haven't already filtered everything out of!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
station_id
\n",
+ "
date
\n",
+ "
type
\n",
+ "
val
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
27
\n",
+ "
AGM00060355
\n",
+ "
20010101
\n",
+ "
PRCP
\n",
+ "
0.039370
\n",
+ "
\n",
+ "
\n",
+ "
30
\n",
+ "
AGM00060360
\n",
+ "
20010101
\n",
+ "
PRCP
\n",
+ "
0.118110
\n",
+ "
\n",
+ "
\n",
+ "
33
\n",
+ "
AGM00060402
\n",
+ "
20010101
\n",
+ "
PRCP
\n",
+ "
0.161417
\n",
+ "
\n",
+ "
\n",
+ "
37
\n",
+ "
AGM00060419
\n",
+ "
20010101
\n",
+ "
PRCP
\n",
+ "
0.078740
\n",
+ "
\n",
+ "
\n",
+ "
47
\n",
+ "
AGM00060445
\n",
+ "
20010101
\n",
+ "
PRCP
\n",
+ "
0.039370
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " station_id date type val\n",
+ "27 AGM00060355 20010101 PRCP 0.039370\n",
+ "30 AGM00060360 20010101 PRCP 0.118110\n",
+ "33 AGM00060402 20010101 PRCP 0.161417\n",
+ "37 AGM00060419 20010101 PRCP 0.078740\n",
+ "47 AGM00060445 20010101 PRCP 0.039370"
+ ]
+ },
+ "execution_count": 7,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "precip_ddf.get_partition(1).head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Ok, we have a lot of weather observations. Now what?\n",
+ "\n",
+ "# Answering Questions With Data ##\n",
+ "\n",
+ "For some reason, residents of particular cities like to lay claim to having the best, or the worst of something. For Los Angeles, it's having the worst traffic. New Yorkers and Chicagoans argue over who has the best pizza. [West Coasters argue about who has the most rain](https://twitter.com/MikeNiccoABC7/status/1105184947663396864).\n",
+ "\n",
+ "Well... as a longtime Atlanta resident suffering from humidity exhaustion, I like to joke that with all the spring showers, _Atlanta_ is the new Seattle.\n",
+ "\n",
+ "Does my theory hold water? Or will the data rain on my bad pun parade?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# How Can I Test My Theory?\n",
+ "\n",
+ "We've already created `precip_df`, which is only the precipitation observations, but it's for all 100k weather stations, most of them no-where near Atlanta, and this is time-series data, so we'll need to aggregate over time ranges.\n",
+ "\n",
+ "To get down to just Atlanta and Seattle precipitation records, we have to...\n",
+ "\n",
+ "1. Extract year, month, and day from the compound \"date\" column, so that we can compare total rainfall across time.\n",
+ "\n",
+ "2. Load up the station metadata file.\n",
+ "\n",
+ "3. There's no \"city\" in the station metadata, so we'll do some geo-math and keep only stations near Atlanta and Seattle.\n",
+ "\n",
+ "4. Use a Groupby to compare changing precipitation patterns across time\n",
+ "\n",
+ "5. Use inner joins to filter the precipitation dataframe down to just Atlanta & Seattle data."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 1. Extracting Finer Grained Date Fields\n",
+ "\n",
+ "We _can_ do a bit of math to separate date parts.."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
station_id
\n",
+ "
date
\n",
+ "
type
\n",
+ "
val
\n",
+ "
year
\n",
+ "
month
\n",
+ "
day
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
27
\n",
+ "
AGM00060355
\n",
+ "
20010101
\n",
+ "
PRCP
\n",
+ "
0.039370
\n",
+ "
2001
\n",
+ "
1
\n",
+ "
1
\n",
+ "
\n",
+ "
\n",
+ "
30
\n",
+ "
AGM00060360
\n",
+ "
20010101
\n",
+ "
PRCP
\n",
+ "
0.118110
\n",
+ "
2001
\n",
+ "
1
\n",
+ "
1
\n",
+ "
\n",
+ "
\n",
+ "
33
\n",
+ "
AGM00060402
\n",
+ "
20010101
\n",
+ "
PRCP
\n",
+ "
0.161417
\n",
+ "
2001
\n",
+ "
1
\n",
+ "
1
\n",
+ "
\n",
+ "
\n",
+ "
37
\n",
+ "
AGM00060419
\n",
+ "
20010101
\n",
+ "
PRCP
\n",
+ "
0.078740
\n",
+ "
2001
\n",
+ "
1
\n",
+ "
1
\n",
+ "
\n",
+ "
\n",
+ "
47
\n",
+ "
AGM00060445
\n",
+ "
20010101
\n",
+ "
PRCP
\n",
+ "
0.039370
\n",
+ "
2001
\n",
+ "
1
\n",
+ "
1
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " station_id date type val year month day\n",
+ "27 AGM00060355 20010101 PRCP 0.039370 2001 1 1\n",
+ "30 AGM00060360 20010101 PRCP 0.118110 2001 1 1\n",
+ "33 AGM00060402 20010101 PRCP 0.161417 2001 1 1\n",
+ "37 AGM00060419 20010101 PRCP 0.078740 2001 1 1\n",
+ "47 AGM00060445 20010101 PRCP 0.039370 2001 1 1"
+ ]
+ },
+ "execution_count": 8,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "precip_ddf['year'] = precip_ddf['date']/10000\n",
+ "precip_ddf['year'] = precip_ddf['year'].astype('int')\n",
+ "\n",
+ "precip_ddf['month'] = (precip_ddf['date'] - precip_ddf['year']*10000)/100\n",
+ "precip_ddf['month'] = precip_ddf['month'].astype('int')\n",
+ "\n",
+ "precip_ddf['day'] = (precip_ddf['date'] - precip_ddf['year']*10000 - precip_ddf['month']*100)\n",
+ "precip_ddf['day'] = precip_ddf['day'].astype('int')\n",
+ "\n",
+ "precip_ddf.get_partition(1).head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "For this dataset, getting date parts is easier with string slicing. However, as is sometimes the case, Dask expects some aspect of cuDF's Python API to match Pandas in a way that [isn't fully compatible yet](https://github.com/rapidsai/cudf/issues/2367).\n",
+ "\n",
+ "That bug will likely be resolved quickly. But, this example is a good chance to show how to workaround similar problems.\n",
+ "\n",
+ "Dask has a [map_partitions](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.Series.map_partitions) function which will apply a given Python function to all partitions of a distributed DataFrame. When you do this on a dask_cudf df, your input is a cuDF object:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ " station_id date type val year month day\n",
+ "0 cat 0 cat 0.0 0 0 0\n",
+ "1 dog 1 dog 1.0 1 1 1\n",
+ "0 0\n",
+ "1 1\n",
+ "Name: date, dtype: int64\n",
+ "0 0\n",
+ "1 1\n",
+ "Name: date, dtype: object\n",
+ "0 0\n",
+ "1 1\n",
+ "Name: date, dtype: int64\n"
+ ]
+ },
+ {
+ "ename": "ValueError",
+ "evalue": "Metadata inference failed in `get_date_parts`.\n\nYou have supplied a custom function and Dask is unable to \ndetermine the type of output that that function returns. \n\nTo resolve this please provide a meta= keyword.\nThe docstring of the Dask function you ran should have more information.\n\nOriginal error is below:\n------------------------\nValueError('Could not convert strings to integer type due to presence of non-integer values.')\n\nTraceback:\n---------\n File \"/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/dataframe/utils.py\", line 180, in raise_on_meta_error\n yield\n File \"/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/dataframe/core.py\", line 5316, in _emulate\n return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))\n File \"\", line 8, in get_date_parts\n df['month'] = date_str.str.slice(4, 6).astype('int')\n File \"/opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/core/series.py\", line 2190, in astype\n raise e\n File \"/opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/core/series.py\", line 2182, in astype\n data = self._column.astype(dtype)\n File \"/opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/core/column/column.py\", line 1009, in astype\n return self.as_numerical_column(dtype)\n File \"/opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/core/column/string.py\", line 4825, in as_numerical_column\n \"Could not convert strings to integer \"\n",
+ "output_type": "error",
+ "traceback": [
+ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+ "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
+ "\u001b[0;32m/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/dataframe/utils.py\u001b[0m in \u001b[0;36mraise_on_meta_error\u001b[0;34m(funcname, udf)\u001b[0m\n\u001b[1;32m 179\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 180\u001b[0;31m \u001b[0;32myield\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 181\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mException\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+ "\u001b[0;32m/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/dataframe/core.py\u001b[0m in \u001b[0;36m_emulate\u001b[0;34m(func, *args, **kwargs)\u001b[0m\n\u001b[1;32m 5315\u001b[0m \u001b[0;32mwith\u001b[0m \u001b[0mraise_on_meta_error\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfuncname\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfunc\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mudf\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpop\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"udf\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 5316\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0m_extract_meta\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0m_extract_meta\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 5317\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
+ "\u001b[0;32m\u001b[0m in \u001b[0;36mget_date_parts\u001b[0;34m(df)\u001b[0m\n\u001b[1;32m 7\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdate_str\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mslice\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m4\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mastype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'int'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 8\u001b[0;31m \u001b[0mdf\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'month'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdate_str\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mslice\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m4\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m6\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mastype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'int'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 9\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdate_str\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mslice\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m4\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m6\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mastype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'int'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+ "\u001b[0;32m/opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/core/series.py\u001b[0m in \u001b[0;36mastype\u001b[0;34m(self, dtype, copy, errors)\u001b[0m\n\u001b[1;32m 2189\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0merrors\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34m\"raise\"\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2190\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2191\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0merrors\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34m\"warn\"\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+ "\u001b[0;32m/opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/core/series.py\u001b[0m in \u001b[0;36mastype\u001b[0;34m(self, dtype, copy, errors)\u001b[0m\n\u001b[1;32m 2181\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2182\u001b[0;31m \u001b[0mdata\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_column\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mastype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2183\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
+ "\u001b[0;32m/opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/core/column/column.py\u001b[0m in \u001b[0;36mastype\u001b[0;34m(self, dtype, **kwargs)\u001b[0m\n\u001b[1;32m 1008\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1009\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mas_numerical_column\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1010\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
+ "\u001b[0;32m/opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/core/column/string.py\u001b[0m in \u001b[0;36mas_numerical_column\u001b[0;34m(self, dtype)\u001b[0m\n\u001b[1;32m 4824\u001b[0m raise ValueError(\n\u001b[0;32m-> 4825\u001b[0;31m \u001b[0;34m\"Could not convert strings to integer \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 4826\u001b[0m \u001b[0;34m\"type due to presence of non-integer values.\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+ "\u001b[0;31mValueError\u001b[0m: Could not convert strings to integer type due to presence of non-integer values.",
+ "\nThe above exception was the direct cause of the following exception:\n",
+ "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
+ "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 11\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mdf\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 12\u001b[0m \u001b[0;31m# any single-GPU function that works in cuDF may be called via dask.map_partitions\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 13\u001b[0;31m \u001b[0mprecip_ddf\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mprecip_ddf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmap_partitions\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mget_date_parts\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 14\u001b[0m \u001b[0mprecip_ddf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_partition\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhead\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 15\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
+ "\u001b[0;32m/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/dataframe/core.py\u001b[0m in \u001b[0;36mmap_partitions\u001b[0;34m(self, func, *args, **kwargs)\u001b[0m\n\u001b[1;32m 676\u001b[0m \u001b[0;32mNone\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mthe\u001b[0m \u001b[0mdivision\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 677\u001b[0m \"\"\"\n\u001b[0;32m--> 678\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mmap_partitions\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfunc\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 679\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 680\u001b[0m \u001b[0;34m@\u001b[0m\u001b[0minsert_meta_param_description\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpad\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m12\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+ "\u001b[0;32m/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/dataframe/core.py\u001b[0m in \u001b[0;36mmap_partitions\u001b[0;34m(func, meta, enforce_metadata, transform_divisions, *args, **kwargs)\u001b[0m\n\u001b[1;32m 5367\u001b[0m \u001b[0;31m# Use non-normalized kwargs here, as we want the real values (not\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5368\u001b[0m \u001b[0;31m# delayed values)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 5369\u001b[0;31m \u001b[0mmeta\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_emulate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfunc\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mudf\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 5370\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5371\u001b[0m \u001b[0mmeta\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmake_meta\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmeta\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mindex\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mmeta_index\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+ "\u001b[0;32m/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/dataframe/core.py\u001b[0m in \u001b[0;36m_emulate\u001b[0;34m(func, *args, **kwargs)\u001b[0m\n\u001b[1;32m 5314\u001b[0m \"\"\"\n\u001b[1;32m 5315\u001b[0m \u001b[0;32mwith\u001b[0m \u001b[0mraise_on_meta_error\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfuncname\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfunc\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mudf\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpop\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"udf\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 5316\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0m_extract_meta\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0m_extract_meta\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 5317\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5318\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
+ "\u001b[0;32m/opt/conda/envs/rapids/lib/python3.7/contextlib.py\u001b[0m in \u001b[0;36m__exit__\u001b[0;34m(self, type, value, traceback)\u001b[0m\n\u001b[1;32m 128\u001b[0m \u001b[0mvalue\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 129\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 130\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgen\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mthrow\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvalue\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtraceback\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 131\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mStopIteration\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mexc\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 132\u001b[0m \u001b[0;31m# Suppress StopIteration *unless* it's the same exception that\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+ "\u001b[0;32m/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/dataframe/utils.py\u001b[0m in \u001b[0;36mraise_on_meta_error\u001b[0;34m(funcname, udf)\u001b[0m\n\u001b[1;32m 199\u001b[0m )\n\u001b[1;32m 200\u001b[0m \u001b[0mmsg\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmsg\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\" in `{0}`\"\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfuncname\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mfuncname\u001b[0m \u001b[0;32melse\u001b[0m \u001b[0;34m\"\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrepr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0me\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtb\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 201\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmsg\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 202\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 203\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
+ "\u001b[0;31mValueError\u001b[0m: Metadata inference failed in `get_date_parts`.\n\nYou have supplied a custom function and Dask is unable to \ndetermine the type of output that that function returns. \n\nTo resolve this please provide a meta= keyword.\nThe docstring of the Dask function you ran should have more information.\n\nOriginal error is below:\n------------------------\nValueError('Could not convert strings to integer type due to presence of non-integer values.')\n\nTraceback:\n---------\n File \"/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/dataframe/utils.py\", line 180, in raise_on_meta_error\n yield\n File \"/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/dataframe/core.py\", line 5316, in _emulate\n return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))\n File \"\", line 8, in get_date_parts\n df['month'] = date_str.str.slice(4, 6).astype('int')\n File \"/opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/core/series.py\", line 2190, in astype\n raise e\n File \"/opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/core/series.py\", line 2182, in astype\n data = self._column.astype(dtype)\n File \"/opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/core/column/column.py\", line 1009, in astype\n return self.as_numerical_column(dtype)\n File \"/opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/core/column/string.py\", line 4825, in as_numerical_column\n \"Could not convert strings to integer \"\n"
+ ]
+ }
+ ],
+ "source": [
+ "def get_date_parts(df):\n",
+ " print(df.head(10))\n",
+ " print(df[\"date\"])\n",
+ " date_str = df['date'].astype('str')\n",
+ " print(date_str)\n",
+ " df['year'] = date_str.str.slice(0, 4).astype('int')\n",
+ " print(date_str.str.slice(0, 4).astype('int'))\n",
+ " df['month'] = date_str.str.slice(4, 6).astype('int')\n",
+ " print(date_str.str.slice(4, 6).astype('int'))\n",
+ " df['day'] = date_str.str.slice(6, 8).astype('int')\n",
+ " return df\n",
+ "# any single-GPU function that works in cuDF may be called via dask.map_partitions\n",
+ "precip_ddf = precip_ddf.map_partitions(get_date_parts)\n",
+ "precip_ddf.get_partition(1).head()\n",
+ "\n",
+ "# def get_date_parts(df):\n",
+ "# date_str = df['date'].astype('str')\n",
+ " \n",
+ "# df['year'] = date_str.str.slice(0, 4)\n",
+ "# print(df['year'])\n",
+ "# df['month'] = date_str.str.slice(4, 6)\n",
+ "# print(df['month'])\n",
+ "# df['day'] = date_str.str.slice(6, 8)\n",
+ "# print(df['month'])\n",
+ " \n",
+ "# df['year'] = date_str.str.slice(0, 4).astype('int')\n",
+ "# df['month'] = date_str.str.slice(4, 6).astype('int')\n",
+ "# df['day'] = date_str.str.slice(6, 8).astype('int')\n",
+ "# return df\n",
+ "\n",
+ "# any single-GPU function that works in cuDF may be called via dask.map_partitions\n",
+ "# precip_ddf = precip_ddf.map_partitions(get_date_parts)\n",
+ "# precip_ddf.get_partition(1).head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The map_partitions pattern is also useful whenever there are cuDF specific functions without a direct mapping into Dask."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 2. Loading Station Metadata ##"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "!head -n 5 /data/weather/ghcnd-stations.txt"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Wait... That's no CSV file! It's fixed-width!\n",
+ "\n",
+ "That's annoying because we don't have a reader for it. We could use CPU code to pre-process the file, making it friendlier for loading into a DataFrame, but, RAPIDS is about end-to-end data processing without leaving the GPU.\n",
+ "\n",
+ "This file is small enough that we can handle it directly with cuDF on a single GPU.\n",
+ "\n",
+ "*Warning*: Make sure you [create your dask-cuda cluster _before_ importing cudf](https://github.com/rapidsai/dask-cuda/issues/32).\n",
+ "\n",
+ "Here's how to cleanup this metadata using cuDF and string operations:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import cudf\n",
+ "\n",
+ "fn = data_dir+'ghcnd-stations.txt'\n",
+ "# There are no '|' chars in the file. Use that to read the file as a single column per line\n",
+ "# quoting=3 handles misplaced quotes in the `name` field \n",
+ "station_df = cudf.read_csv(fn, sep='|', quoting=3, names=['lines'], header=None)\n",
+ "\n",
+ "# you can use normal DataFrame .str accessor, and chain operators together\n",
+ "station_df['station_id'] = station_df['lines'].str.slice(0, 11).str.strip()\n",
+ "station_df['latitude'] = station_df['lines'].str.slice(12, 20).str.strip()\n",
+ "station_df['longitude'] = station_df['lines'].str.slice(21, 30).str.strip()\n",
+ "station_df = station_df.drop('lines')\n",
+ "\n",
+ "station_df.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Managing Memory\n",
+ "\n",
+ "While GPU memory is very fast, there's less of it than host RAM. It's a good idea to avoid storing lots of columns that aren't useful for what you're trying to do, especially when they're strings.\n",
+ "\n",
+ "For example, for the station metadata, there are more columns than we parsed out above. In this workflow we only need `station_id`, `latitude`, and `longitude`, so we skipped parsing the rest of the columns.\n",
+ "\n",
+ "We also need to convert latitude and longitude from strings to floats, and convert the single-GPU DataFrame to a Dask DataFrame that can be distributed across workers."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# you can cast string columns to numerics\n",
+ "station_df['latitude'] = station_df['latitude'].astype('float')\n",
+ "station_df['longitude'] = station_df['longitude'].astype('float')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 3. Filtering Weather Stations by Distance\n",
+ "\n",
+ "Initially we planned to use our [existing Haversine Distance user defined function](https://medium.com/rapids-ai/user-defined-functions-in-rapids-cudf-2d7c3fc2728d) to figure out which stations are within a given distance from a city. However, that relies on a [numba CUDA JIT'ed kernel](https://numba.pydata.org/numba-doc/dev/cuda/index.html), which would be slower and would incur compilation time the first time you call it.\n",
+ "\n",
+ "Now that [cuSpatial](https://github.com/rapidsai/cuspatial) is available as [a nightly conda package](https://anaconda.org/rapidsai-nightly/cuspatial), we can use it without having to build from source:\n",
+ "\n",
+ "```\n",
+ "conda install -c conda-forge -c rapidsai-nightly cuspatial\n",
+ "```\n",
+ "\n",
+ "For this scenario, we've manually looked up Atlanta and Seattle's city centers and will fill `cudf.Series` with their latitude and longitude values. Then we can call a cuSpatial function to compute the distance between each station and each city."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import cuspatial\n",
+ "\n",
+ "# fill new Series with Atlanta lat/lng\n",
+ "station_df['atlanta_lat'] = 33.7490\n",
+ "station_df['atlanta_lng'] = -84.3880\n",
+ "# compute distance from each station to Atlanta\n",
+ "station_df['atlanta_dist'] = cuspatial.haversine_distance(\n",
+ " station_df['longitude'], station_df['latitude'],\n",
+ " station_df['atlanta_lng'], station_df['atlanta_lat']\n",
+ ")\n",
+ "\n",
+ "# fill new Series with Seattle lat/lng\n",
+ "station_df['seattle_lat'] = 47.6219\n",
+ "station_df['seattle_lng'] = -122.3517\n",
+ "# compute distance from each station to Seattle\n",
+ "station_df['seattle_dist'] = cuspatial.haversine_distance(\n",
+ " station_df['longitude'], station_df['latitude'],\n",
+ " station_df['seattle_lng'], station_df['seattle_lat']\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Checking the Results"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Inspect the results:\n",
+ "atlanta_stations_df = station_df.query('atlanta_dist <= 25')\n",
+ "seattle_stations_df = station_df.query('seattle_dist <= 25')\n",
+ "\n",
+ "print(f'Atlanta Stations: {len(atlanta_stations_df)}')\n",
+ "print(f'Seattle Stations: {len(seattle_stations_df)}')\n",
+ "\n",
+ "atlanta_stations_df.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "[Google tells me those station ids are from Smyrna](https://geographic.org/global_weather/georgia/smyrna_23_ne_002.html), a town just outside of Atlanta's perimeter. Our distance calculation worked!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 4. Grouping & Aggregating by Time Range\n",
+ "\n",
+ "Before using an inner join to filter down to city-specific precipitation data, we can use a groupby to sum the precipitation for station and year. That'll allow the join to proceed faster and use less memory.\n",
+ "\n",
+ "One total precipitation record per station per year is relatively small, and we're going to need to graph this data, so we'll go ahead and `compute()` the result, asking Dask to aggregate across the 200+ years worth of data, bringing the results back to the client as a single GPU cuDF DataFrame.\n",
+ "\n",
+ "Note that with Dask, data is partitioned and distributed across multiple workers. Some operations require that workers \"[shuffle](http://docs.dask.org/en/latest/dataframe-groupby.html#)\" data from their partitions back and forth across the network, which has major performance implications. Today join, groupby, and sort operations can be fairly network constrained.\n",
+ "\n",
+ "See the [slides](https://www.slideshare.net/MatthewRocklin/ucxpython-a-flexible-communication-library-for-python-applications) from a recent talk at GTC San Jose to learn more about [ongoing efforts to integrate Dask with UCX](https://github.com/rapidsai/ucx-py/) and allow it to use accelerated networking hardware like Infiniband and [nvlink](https://www.nvidia.com/en-us/data-center/nvlink/).\n",
+ "\n",
+ "In the meantime, distributed operators that require shuffling like joins, groupbys, and sorts work, albeit not as fast as we'd like."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "precip_year_ddf = precip_ddf.groupby(by=['station_id', 'year']).val.sum()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Note that we're calling `compute` again here. This tells Dask to actually start computing the full set of processing logic defined thus far:\n",
+ "\n",
+ "1. Read and decompress 232 gzipped files (about 100 GB decompressed)\n",
+ "2. Send to the GPU and parse\n",
+ "3. Filter down to precipitation records\n",
+ "4. Apply a conversion to inches\n",
+ "5. Sum total inches of rain per year per each of the 108k weather stations\n",
+ "6. Combine and pull results a single GPU DataFrame on the client host\n",
+ "\n",
+ "To wit.. this will take time."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "%time precip_year_df = precip_year_ddf.compute()\n",
+ "\n",
+ "# Convert from the groupby multi-indexed DataFrame back to a normal DF which we can use with merge\n",
+ "precip_year_df = precip_year_df.reset_index()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 5. Using Inner Joins to Filter Weather Observations\n",
+ "\n",
+ "We have separate DataFrames containing Atlanta and Seattle stations, and we have our total precipitation grouped by `station_id` and `year`. Computing inner joins can let us compute total precipitation by year for just Atlanta and Seattle."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "%time atlanta_precip_df = precip_year_df.merge(atlanta_stations_df, on=['station_id'], how='inner')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "atlanta_precip_df.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "%time seattle_precip_df = precip_year_df.merge(seattle_stations_df, on=['station_id'], how='inner')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "seattle_precip_df.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Lastly, we need to normalize the total amount of rain in each city by the number of stations which collected rainfall: Seattle had twice as many stations collecting, but that doesn't mean more total rain fell! "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "atlanta_rain = atlanta_precip_df.groupby(['year']).val.sum()/len(atlanta_stations_df)\n",
+ "atlanta_rain.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "seattle_rain = seattle_precip_df.groupby(['year']).val.sum()/len(seattle_stations_df)\n",
+ "\n",
+ "seattle_rain.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Visualizing the Answer\n",
+ "\n",
+ "To generate the graphs in the cells below, first you'll need to ```conda install -y python-graphviz matplotlib```"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "%matplotlib inline\n",
+ "import matplotlib.pyplot as plt\n",
+ "from matplotlib.pyplot import *\n",
+ "\n",
+ "plt.close('all')\n",
+ "plt.rcParams['figure.figsize'] = [20, 10]\n",
+ "\n",
+ "fig, ax = subplots()\n",
+ "\n",
+ "atlanta_rain.to_pandas().plot(ax=ax)\n",
+ "seattle_rain.to_pandas().plot(ax=ax)\n",
+ "\n",
+ "ax.legend(['Atlanta', 'Seattle'])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Results\n",
+ "\n",
+ "It looks like I'm right (mostly)! At least for roughly the last 80 years, it rains more by volume in Atlanta than it does in Seattle. The data seems to confirm my suspicions.\n",
+ "\n",
+ "But as usual the answer raises additional questions:\n",
+ "\n",
+ "1. Without singling out Atlanta and Seattle, which city actually has the most precipitation by volume?\n",
+ "\n",
+ "2. Why is there such a large increase in observed precipitation in the last 10 years?\n",
+ "\n",
+ "3. One friend noted that it rains more frequently in Seattle, just not as hard. A contrarian was quick to point out that it mists a lot in Seattle. How often is it just \"misty\", but not really raining?\n",
+ "\n",
+ "We'll revisit these questions in a future post, and look forward to seeing what kinds of analyses YOU come up with."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Takeaways\n",
+ "\n",
+ "We just showed some of the ways you can use Dask and cuDF to parallelize typical data processing tasks on multiple GPUs. Hopefully this notebook provides useful examples to refer to while doing your own ETL & analytics work.\n",
+ "\n",
+ "For more info on what's working today with Dask and cuDF, see [our summary](https://docs.rapids.ai/api/cudf/stable/), and follow [our ongoing development](https://github.com/rapidsai/cudf).\n",
+ "\n",
+ "Also checkout out other [community contributed notebooks](https://github.com/rapidsai/notebooks-contrib), and submit your own!"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.10"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/conference_notebooks/TMLS_2020/notebooks/Taxi/Overview-Taxi.ipynb b/conference_notebooks/TMLS_2020/notebooks/Taxi/Overview-Taxi.ipynb
new file mode 100644
index 00000000..7db9dc7a
--- /dev/null
+++ b/conference_notebooks/TMLS_2020/notebooks/Taxi/Overview-Taxi.ipynb
@@ -0,0 +1,868 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Intro to RAPIDS using the New York City Yellow Taxi Data \n",
+ "light on Data Science, heavy on comparisons.\n",
+ "\n",
+ "This notebook is for the The Toronto Machine Learning Summit, Nov 16 -29, 2020\n",
+ "\n",
+ "![TMLS](./img/TMLS.png)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "This notebook includes\n",
+ "\n",
+ "* cudf - for basic ETL and some __statistical analysis__ \n",
+ "* cuml - for __machine learning__\n",
+ "* cugraph - for some __graph analysis__\n",
+ "* cuxfilter - for __visualization__\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "----\n",
+ "# Setup"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# load the libraries\n",
+ "import cudf\n",
+ "\n",
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "import math\n",
+ "\n",
+ "import os\n",
+ "import gc\n",
+ "\n",
+ "from collections import OrderedDict\n",
+ "import argparse\n",
+ "import datetime\n",
+ "import time"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "try: \n",
+ " import tqdm\n",
+ "except ModuleNotFoundError:\n",
+ " os.system('pip install tqdm')\n",
+ " import tqdm"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Let's use Unified Memory (aka managed memory) so that we try and avoid OOM errors \n",
+ "# start by importing the RAPIDS Memory Manager and then reinitializing with managed memory turn on\n",
+ "import rmm\n",
+ "\n",
+ "rmm.reinitialize( \n",
+ " managed_memory=True, # Use managed memory, this allows for oversubscription of the GPU\n",
+ " pool_allocator=False, # default is False\n",
+ " devices=0, # GPU device IDs to register. By default, registers only GPU 0.\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Download the data"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "top_dir = \"./\"\n",
+ "data_dir = \"./nyctaxi\""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Download Taxi data\n",
+ "\n",
+ "if os.path.exists(data_dir) == False:\n",
+ " import nyctaxi_data\n",
+ "\n",
+ " print(\"downloading data\")\n",
+ " nyctaxi_data.download_nyctaxi_data([\"2016\"], top_dir)\n",
+ " "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "----\n",
+ "\n",
+ "# cuDF - Accelerated Data Frame "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# get a list of files\n",
+ "data_path = top_dir + \"nyctaxi/2016\"\n",
+ "\n",
+ "files = []\n",
+ "\n",
+ "for f in sorted(os.listdir(data_path)):\n",
+ " if f[0:6] != 'yellow':\n",
+ " continue\n",
+ " \n",
+ " fname = os.path.join(data_path, f)\n",
+ " \n",
+ " files.append(fname)\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "files"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "!du -sh $data_path"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Loading data performance test"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def read_pandas(f):\n",
+ " start_t = time.time()\n",
+ " df = pd.read_csv(f)\n",
+ " end_t = time.time() - start_t\n",
+ "\n",
+ " return df, end_t"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def read_cudf(f):\n",
+ " start_t = time.time()\n",
+ " df = cudf.read_csv(f)\n",
+ " end_t = time.time() - start_t\n",
+ "\n",
+ " return df, end_t"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "_ = read_pandas(files[0])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Load data with Pandas\n",
+ "\n",
+ "data = []\n",
+ "\n",
+ "start_t = time.time()\n",
+ "\n",
+ "for f in files:\n",
+ " print(\"\\treading \" + f, end = '')\n",
+ " df, t = read_pandas(f)\n",
+ " print(\" ... in time of \" + str(t) + \" seconds\")\n",
+ " data.append(df)\n",
+ " \n",
+ "taxi_pdf = pd.concat(data)\n",
+ "\n",
+ "end_t = time.time()\n",
+ "\n",
+ "print(f\"loaded {len(taxi_pdf):,} records in {(end_t - start_t):2f} seconds\")\n",
+ "\n",
+ "del data"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Load data with RAPIDS cuDF\n",
+ "\n",
+ "data = []\n",
+ "\n",
+ "start_t = time.time()\n",
+ "\n",
+ "for f in files:\n",
+ " print(\"\\treading \" + f, end = '')\n",
+ " df, t = read_cudf(f)\n",
+ " print(\" ... in time of \" + str(t)+ \" seconds\")\n",
+ " data.append(df)\n",
+ "\n",
+ "taxi_gdf = cudf.concat(data)\n",
+ "\n",
+ "end_t = time.time()\n",
+ "\n",
+ "print(f\"loaded {len(taxi_gdf):,} records in {(end_t - start_t):2f} seconds\")\n",
+ "\n",
+ "del data"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "taxi_gdf.head(5)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Sort Comparisons - Single Field"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "%%time\n",
+ "sp = taxi_pdf.sort_values(by='trip_distance',ascending=False)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sp.head(5)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "%%time\n",
+ "sg = taxi_gdf.sort_values(by='trip_distance',ascending=False)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sg.head(5)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Group By - Single Column "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "%%time\n",
+ "gbp = taxi_pdf.groupby('passenger_count').count()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "gbp.head(5)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "%%time\n",
+ "gbg = taxi_gdf.groupby('passenger_count').count()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "gbg.head(5)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Fun with Data"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "%%time\n",
+ "print(f\"Max fare was ${taxi_pdf['fare_amount'].max():,}\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "%%time\n",
+ "print(f\"Max fare was ${taxi_gdf['fare_amount'].max():,}\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# looking at that huge fare\n",
+ "maxf = taxi_gdf['fare_amount'].max()\n",
+ "taxi_gdf.query('fare_amount == @maxf')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "print(f\"Farthest trip was {taxi_gdf['trip_distance'].max():,} miles\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# How long did it take to drive that distance?\n",
+ "maxd= taxi_gdf['trip_distance'].max()\n",
+ "taxi_gdf.query('trip_distance == @maxd')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Changing data types"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# change some data types\n",
+ "taxi_gdf = taxi_gdf.astype({'tpep_pickup_datetime':'datetime64[ms]', 'tpep_dropoff_datetime':'datetime64[ms]'})"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Filtering data"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# filter out records with missing or outlier values\n",
+ "query_frags = (\"(fare_amount > 0 and fare_amount < 500) \" +\n",
+ " \"and (passenger_count > 0 and passenger_count < 6) \" +\n",
+ " \"and (pickup_longitude > -75 and pickup_longitude < -73) \" +\n",
+ " \"and (dropoff_longitude > -75 and dropoff_longitude < -73) \" +\n",
+ " \"and (pickup_latitude > 40 and pickup_latitude < 42) \" +\n",
+ " \"and (dropoff_latitude > 40 and dropoff_latitude < 42)\" +\n",
+ " \"and (pickup_latitude != dropoff_latitude) \" +\n",
+ " \"and (pickup_longitude != dropoff_longitude)\"\n",
+ " )\n",
+ "\n",
+ "taxi_gdf = taxi_gdf.query(query_frags)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Add some new features"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# easier to reference time by YYYY MM DD version a time stamps\n",
+ "taxi_gdf['hour'] = taxi_gdf['tpep_pickup_datetime'].dt.hour\n",
+ "taxi_gdf['year'] = taxi_gdf['tpep_pickup_datetime'].dt.year\n",
+ "taxi_gdf['month'] = taxi_gdf['tpep_pickup_datetime'].dt.month\n",
+ "taxi_gdf['day'] = taxi_gdf['tpep_pickup_datetime'].dt.day\n",
+ "taxi_gdf['diff'] = taxi_gdf['tpep_dropoff_datetime'].astype('int64') - taxi_gdf['tpep_pickup_datetime'].astype('int64')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def day_of_the_week_kernel(day, month, year, day_of_week):\n",
+ " for i, (d_1, m_1, y_1) in enumerate(zip(day, month, year)):\n",
+ " if month[i] < 3:\n",
+ " shift = month[i]\n",
+ " else:\n",
+ " shift = 0\n",
+ " Y = year[i] - (month[i] < 3)\n",
+ " y = Y - 2000\n",
+ " c = 20\n",
+ " d = day[i]\n",
+ " m = month[i] + shift + 1\n",
+ " day_of_week[i] = (d + math.floor(m * 2.6) + y + (y // 4) + (c // 4) - 2 * c) % 7\n",
+ " \n",
+ "taxi_gdf = taxi_gdf.apply_rows(\n",
+ " day_of_the_week_kernel\n",
+ " , incols = ['day', 'month', 'year']\n",
+ " , outcols = {'day_of_week': np.int32}\n",
+ " , kwargs = {}\n",
+ " )"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "taxi_gdf.head(5)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Basic Statistical Data Science\n",
+ "\n",
+ "### Look at some feature - by Hour"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# 1) Let's look at a plot of fare by hour\n",
+ "%matplotlib inline\n",
+ "taxi_gdf.groupby('hour').fare_amount.mean().to_pandas().sort_index().plot(legend=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# 2) Tips by hour\n",
+ "%matplotlib inline\n",
+ "taxi_gdf.groupby('hour').tip_amount.mean().to_pandas().sort_index().plot(legend=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# 3) Number of taxi rides by Hour\n",
+ "%matplotlib inline\n",
+ "taxi_gdf['hour'].groupby('hour').count().to_pandas().sort_index().plot(legend=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Look at what days are the busiest\n",
+ "%matplotlib inline\n",
+ "taxi_gdf.groupby('day_of_week').day_of_week.count().to_pandas().sort_index().plot(legend=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# What days have the best tips\n",
+ "%matplotlib inline\n",
+ "taxi_gdf.groupby('day_of_week').tip_amount.mean().to_pandas().sort_index().plot(legend=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Dropping Columns"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "taxi_gdf = taxi_gdf.drop('store_and_fwd_flag', axis=1)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "taxi_gdf.dtypes"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# cuML - Accelerated Machine Learning"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### In Corey's talk"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "---\n",
+ "# cuGraph - Accelerated Graph Analytics\n",
+ "\n",
+ "We need vertex IDs to be integer values but what we have are lat-long pairs (float64). There are two way that we can address the issue. The hard way and an easy way"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import cugraph"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "taxi_subset = taxi_gdf[['pickup_longitude', 'pickup_latitude','dropoff_longitude', 'dropoff_latitude', 'trip_distance']].reset_index()\n",
+ "taxi_subset['count'] = 1\n",
+ "del taxi_gdf"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Create vertices and edges the hard way"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# create node ID from lat-long combinatiuons\n",
+ "nodes = [\n",
+ " taxi_subset[['pickup_longitude', 'pickup_latitude']].drop_duplicates().rename(columns={'pickup_longitude': 'long', 'pickup_latitude': 'lat'})\n",
+ " , taxi_subset[['dropoff_longitude', 'dropoff_latitude']].drop_duplicates().rename(columns={'dropoff_longitude': 'long', 'dropoff_latitude': 'lat'})\n",
+ "]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "nodes = cudf.concat(nodes).drop_duplicates().reset_index(drop=True).reset_index().rename(columns={'index': 'id'})\n",
+ "nodes.head(5)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "print('Total number of geo points in the dataset: {0:,}'.format(len(nodes)))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "edges = (\n",
+ " taxi_subset[['pickup_longitude', 'pickup_latitude','dropoff_longitude', 'dropoff_latitude', 'trip_distance']]\n",
+ " .drop_duplicates()\n",
+ " .rename(columns={'pickup_longitude': 'long', 'pickup_latitude': 'lat'})\n",
+ " .merge(nodes, on=['lat', 'long'])\n",
+ " .rename(columns={'long': 'pickup_longitude', 'lat': 'pickup_latitude', 'id': 'pickup_id', 'dropoff_longitude': 'long', 'dropoff_latitude': 'lat'})\n",
+ " .merge(nodes, on=['lat', 'long'])\n",
+ " .rename(columns={'long': 'dropoff_longitude', 'lat': 'dropoff_latitude', 'id': 'dropoff_id'})\n",
+ ")[['pickup_id', 'dropoff_id', 'trip_distance']]\n",
+ "\n",
+ "edges.head(5)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "len(edges)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "g = cugraph.Graph()\n",
+ "g.from_cudf_edgelist(edges, source='pickup_id', destination='dropoff_id')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Pagerank"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "%%time\n",
+ "page = cugraph.pagerank(g, alpha=.85, max_iter=1000, tol=1.0e-05)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "page.sort_values(by='pagerank', ascending=False).head(5).to_pandas()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Now the easy way"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "g2 = cugraph.Graph()\n",
+ "g2.from_cudf_edgelist(taxi_subset, \n",
+ " source=['pickup_longitude', 'pickup_latitude'], \n",
+ " destination=['dropoff_longitude', 'dropoff_latitude'], \n",
+ " edge_attr='count',\n",
+ " renumber=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "page = cugraph.pagerank(g2, alpha=.85, max_iter=1000, tol=1.0e-05)\n",
+ "page.sort_values(by='pagerank', ascending=False).head(5).to_pandas()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "---"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "cugraph_dev",
+ "language": "python",
+ "name": "cugraph_dev"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.8.6"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/conference_notebooks/TMLS_2020/notebooks/Taxi/img/TMLS.png b/conference_notebooks/TMLS_2020/notebooks/Taxi/img/TMLS.png
new file mode 100644
index 00000000..44a8234f
Binary files /dev/null and b/conference_notebooks/TMLS_2020/notebooks/Taxi/img/TMLS.png differ
diff --git a/conference_notebooks/KDD_2020/notebooks/Taxi/nyctaxi_data.py b/conference_notebooks/TMLS_2020/notebooks/Taxi/nyctaxi_data.py
similarity index 100%
rename from conference_notebooks/KDD_2020/notebooks/Taxi/nyctaxi_data.py
rename to conference_notebooks/TMLS_2020/notebooks/Taxi/nyctaxi_data.py
diff --git a/getting_started_materials/README.md b/getting_started_materials/README.md
new file mode 100644
index 00000000..212fcd1c
--- /dev/null
+++ b/getting_started_materials/README.md
@@ -0,0 +1,199 @@
+# **Intro to RAPIDS Course for Content Creators**
+## Introduction
+
+In this intro course, we cover the basic skills you need to accelerate your data analytics and ML pipeline with RAPIDS. Get to know the RAPIDS core libraries: cuDF, cuML, cuGraph, and cuXFilter, as well as community libraries, including: XGBoost, Dask, and BlazingSQL, to accelerate how you:
+- Ingest data
+- Perform your prepare your data with ETL
+- Run modeling, inferencing, and predicting algorithms on the data in a GPU dataframe
+- Visualize your data throughout the process.
+
+Each of the three modules should take less than 2 hours to complete. When complete, you should be able to:
+1. Take an existing workflow in a data science or ML pipeline and use a RAPIDS to accelerate it with your GPU
+1. Create your own workflows from scratch
+
+This course was written with the expectation that you know Python and Jupyter Lab. It is helpful, but not necessary, to have at least some understanding of Pandas, Scikit Learn, NetworkX, and Datashader.
+
+[You should be able to run these exercises and use these libraries on any machine with these prerequisites](https://rapids.ai/start.html#PREREQUISITES), which namely are
+- OS of Ubuntu 16.04 or 18.04 or CentOS7 with gcc 5.4 & 7.3
+- an NVIDIA GPU of Pascal Architecture or better (basically 10xx series or newer)
+
+RAPIDS works on a broad range of GPUs, including NVIDIA GeForce, TITAN, Quadro, Tesla, A100, and DGX systems
+## NVIDIA Titan RTX
+- [NVIDIA Spot on Titan RTX and RAPIDS](https://www.youtube.com/watch?v=tsWPeZTLpkU)
+- [t-SNE 600x Speed up on Titan RTX](https://www.youtube.com/watch?v=_4OehmMYr44)
+
+
+
+## Questions?
+There are a few channels to ask questions or start a discussion:
+- [GoAI Slack](https://join.slack.com/t/rapids-goai/shared_invite/enQtMjE0Njg5NDQ1MDQxLTJiN2FkNTFkYmQ2YjY1OGI4NTc5Y2NlODQ3ZDdiODEwYmRiNTFhMzNlNTU5ZWJhZjA3NTg4NDZkMThkNTkxMGQ) to discuss issues and troubleshoot with the RAPIDS community
+- [RAPIDS GitHub](https://github.com/rapidsai) to submit feature requests and report bugs
+
+# **Getting Started**
+There are 3 steps to installing RAPIDS
+1. Provisioning a GPU enabled workspace
+1. Installing RAPIDS Prerequisites
+1. Installing RAPIDS libraries
+
+## 1. Provisioning a GPU-Enabled Workspace
+When installing RAPIDS, first provision a RAPIDS Compatible GPU. The GPU must be **NVIDIA Pascal™ or better with compute capability 6.0+**. Here is a list of compatible GPUs. This GPU can local, like in a workstation or server, or in the Cloud. GPUs can reside in:
+- Shared cloud
+- Dedicated cloud
+- Local workspace
+
+### Using Cloud Instance(s)
+There are two option for using Cloud Instances:
+1. Shared, **free** instances like app.blazingsql.com and Google Colab
+1. Dedicated, **paid** [usually] [GPU instances from providers like AWS, Azure, GCP, Paperspace, and more](https://rapids.ai/cloud.html)
+
+### Shared Cloud via Free Instances
+Free cloud instances have quick start capabilities or scripts to ease onboarding.
+- **Google Colab**: The installation will take about 8 minutes. First select a GPU instance from Runtime type. After, use the provided RAPIDS installation scripts, found here by copying and pasting into a code cell. Please note, RAPIDS will not run on an unsupported GPU instance like K80 - ONLY the T4, P4, and P100s (Refer to `!nvidia-smi`). If you are given a K80, please factory reset your instance and the check again.
+- **app.blazingsql.com**: these instances are preloaded with RAPIDS and you can start right away
+
+### Dedicated Cloud via Paid Instances
+There are several ways to provision a dedicated cloud GPU workspace, and our instructions are found here. Your OS will need to be **Ubuntu or RHEL/CentOS 7**. For installing RAPIDS, These instances follow the same installation process as a local instance.
+
+## 2. Installing RAPIDS Prerequisites
+### Downloads
+You can satisfy your prerequisites to install RAPIDS by:
+1. Install OS and GPU Drivers and OS
+1. Install Packaging Environment (Docker or Conda)
+
+### OS and GPU Drivers
+ Please ensure that your workstation has these installed as our prerequisites are as follows:
+- GPU: NVIDIA Pascal™ or better with compute capability 6.0+ (completed above)
+- OS: Ubuntu 16.04/18.04 or CentOS 7 with gcc/++ 7.5+
+ - See RSN 1 for details on our recent update to gcc/++ 7.5
+ - RHEL 7 support is provided through CentOS 7 builds/installs
+- CUDA & NVIDIA Drivers: One of the following supported versions:
+ - 10.0 & v410.48+ (valid option for version 0.14 and earlier only)
+ - 10.1.2 & v418.87+
+ - 10.2 & v440.33+
+ - 11.0 (valid option for version 0.16 and later)
+- Python
+ - 3.6 (valid option for version 0.14 and earlier)
+ - 3.7
+ - 3.8 (valid option for version 0.16 and later)
+
+
+### Install Packaging Environment (Docker or Conda)
+Depending on if you prefer to use RAPIDS with Docker or Conda, you will need these also installed:
+
+- If Docker: Docker CE v19.03+ and nvidia-container-toolkit
+ - Legacy Support - Docker CE v17-18 and nvidia-docker2
+
+- If Conda, please install
+ - [Miniconda](https://conda.io/miniconda.html) for a minimal conda installation
+ - [Anaconda](https://www.anaconda.com/download) for full conda installation
+ - [Mamba inside of conda](https://github.com/TheSnakePit/mamba) for a faster conda solving (untested)
+
+### 3. Install RAPIDS Libraries
+
+- Use the [Interactive RAPIDS release selector](https://rapids.ai/start.html#rapids-release-selector) to install RAPIDS as you want it. The install script at the bottom will update as you change your install parameters of **method, desired RAPIDS release, desired RAPIDS packages, Linux verison, and CUDA version**. Here is an image of it below.
+
+#
+Great! Now that you're done getting up and running, let's move on to the Data Science!
+
+## **1. The Basics of RAPIDS: cuDF and Dask**
+### Introduction
+cuDF lets you create and manipulate your dataframes on GPUs. All other RAPIDS libraries use cuDF to model, infer, regress, reduce, and predict outcomes. The cuDF API is designed to be similar to Pandas with minimal code changes.
+- [latest RAPIDS cuDF documentation](https://docs.rapids.ai/api)
+- [RAPIDS cuDF GitHub repo](https://github.com/rapidsai/cudf)
+
+There are situations where the dataframe is larger than available GPU memory. Dask is used to help RAPIDS algorithms scale up through distributed computing. Whether you have a single GPU, multiple GPUs, or clusters of multiple GPUs, Dask is used for distributed computing calculations and orchstrattion of the processing of GPU dataframe, no matter the size, just like a regular CPU cluster.
+
+Let's get started with a couple videos!
+
+### Videos
+
+| Video Title | Description |
+|------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| [Video- Getting Started with RAPIDS](https://www.youtube.com/watch?v=T2AU0iVbY5A). | Walks through the [01_Introduction_to_RAPIDS](intro_tutorials_and_guides/01_Introduction_to_RAPIDS.ipynb) notebook which shows, at a high level, what each of the packages in RAPIDS are as well as what they do. |
+| [Video - RAPIDS: Dask and cuDF NYCTaxi Screencast](https://www.youtube.com/watch?v=gV0cykgsTPM) | Shows you have you can use RAPIDS and Dask to easily ingest and model a large dataset (1 year's worth of NYCTaxi data) and then create a model around the question "when do you get the best tips". This same workload can be done on any GPU. |
+
+### Learning Notebooks
+
+
+| Notebook Title | Description |
+|------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| [01_Introduction_to_RAPIDS](intro_tutorials_and_guides/01_Introduction_to_RAPIDS.ipynb) | This notebook shows at a high level what each of the packages in RAPIDS are as well as what they do. |
+| [02_Introduction_to_cuDF](intro_tutorials_and_guides/02_Introduction_to_cuDF.ipynb) | This notebook shows how to work with cuDF DataFrames in RAPIDS. |
+| [03_Introduction_to_Dask](intro_tutorials_and_guides/03_Introduction_to_Dask.ipynb) | This notebook shows how to work with Dask using basic Python primitives like integers and strings. |
+| [04_Introduction_to_Dask_using_cuDF_DataFrames](intro_tutorials_and_guides/04_Introduction_to_Dask_using_cuDF_DataFrames.ipynb) | This notebook shows how to work with cuDF DataFrames using Dask. |
+| [Guide to UDFs](https://github.com/rapidsai/cudf/blob/branch-0.18/docs/cudf/source/guide-to-udfs.ipynb) | This notebook provides and overview of User Defined Functions with cuDF |
+
+
+
+### Extra credit and Exercises
+- [10 minute review of cuDF](https://github.com/rapidsai/cudf/blob/branch-0.18/docs/cudf/source/10min.ipynb)
+- [Extra Credit - 10 minute guide to cuDF and cuPY](https://github.com/rapidsai/cudf/blob/branch-0.18/docs/cudf/source/10min-cudf-cupy.ipynb)
+- [Extra Credit - Multi-GPU with Dask-cuDF](https://rapidsai.github.io/projects/cudf/en/0.18.0/dask-cudf.html)
+- [Review and Exercises 1- Review of cuDF](../the_archive/archived_rapids_event_notebooks/SCIPY_2019/cudf/01-Intro_to_cuDF.ipynb)
+- [Review and Exercises 2- Creating User Defined Functions (UDFs) in cuDF](../the_archive/archived_rapids_event_notebooks/SCIPY_2019/cudf/02-Intro_to_cuDF_UDFs.ipynb)
+
+## **2. Accelerating those Algorithms: cuML and XGBoost**
+### Introduction
+Congrats learning the basics of cuDF and Dask. Now let's take a look at cuML
+
+cuML runs many common scikit-learn algorithms and methods on cuDF dataframes to model, infer, regress, reduce, and predict outcomes on the data. [Among the ever growing suite of algorithms, you can perform several GPU accelerated algortihms for each of these methods:]()
+
+- Classification / Regression
+- Inference
+- Clustering
+- Decomposition & Dimensionality Reduction
+- Time Series
+
+While we look at cuML , we'll take a look at how further on how to increase your speed up with [XGBoost](https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/), scale it out with Dask XGboost, then see how to use cuML for Dimensionality Reduction and Clustering.
+- [latest RAPIDS cuML documentation](https://docs.rapids.ai/api)
+- [RAPIDS cuML GitHub repo](https://github.com/rapidsai/cuml)
+
+Let's look at a few video walkthroughs of XGBoost, as it may be an unfamiliar concept to some, and then experience how to use the above in your learning notebooks.
+
+### Videos
+
+| Video Title | Description |
+|------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| [Video - Introduction to XGBoost](https://www.youtube.com/watch?v=EQR3bP6XFW0) | Walks through the [07_Introduction_to_XGBoost](getting_started_notebooks/intro_tutorials/07_Introduction_to_XGBoost.ipynb) notebook and shows how to work with GPU accelerated XGBoost in RAPIDS. |
+| [Video - Introduction to Dask XGBoost](https://www.youtube.com/watch?v=q8HfEZythjM) | Walks through the [08_Introduction_to_Dask_XGBoost](getting_started_notebooks/intro_tutorials/08_Introduction_to_Dask_XGBoost.ipynb) notebook and hows how to work with Dask XGBoost in RAPIDS. This can be run on a single GPU as well and is useful when your dataset is larger than the memory size of your GPU. Will be deprecated in 0.15, and removed in 0.16 |
+
+### Learning Notebooks
+
+| Notebook Title | Description |
+|------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| [06_Introduction_to_Supervised_Learning](intro_tutorials_and_guides/06_Introduction_to_Supervised_Learning.ipynb) | This notebook shows how to do GPU accelerated Supervised Learning in RAPIDS. |
+| [07_Introduction_to_XGBoost](intro_tutorials_and_guides/07_Introduction_to_XGBoost.ipynb) | This notebook shows how to work with GPU accelerated XGBoost in RAPIDS. |
+| [09_Introduction_to_Dimensionality_Reduction](intro_tutorials_and_guides/09_Introduction_to_Dimensionality_Reduction.ipynb) | This notebook shows how to do GPU accelerated Dimensionality Reduction in RAPIDS. |
+| [10_Introduction_to_Clustering](intro_tutorials_and_guides/10_Introduction_to_Clustering.ipynb) | This notebook shows how to do GPU accelerated Clustering in RAPIDS. |
+
+
+### Extra credit and Exercises
+
+- [10 Review of cuML Estimators](https://github.com/rapidsai/cuml/blob/branch-0.18/docs/source/estimator_intro.ipynb)
+
+- [Review and Exercises 1 - Linear Regression](../the_archive/archived_rapids_event_notebooks/SCIPY_2019/cuml/01-Introduction-LinearRegression-Hyperparam.ipynb)
+
+- [Review and Exercises 2 - Logistic Regression](../the_archive/archived_rapids_event_notebooks/SCIPY_2019/cuml/02-LogisticRegression.ipynb)
+
+- [Review and Exercises 3- Intro to UMAP](../the_archive/archived_rapids_event_notebooks/SCIPY_2019/cuml/03-UMAP.ipynb)
+
+### RAPIDS cuML Example Notebooks
+- [Index of Notebooks](https://github.com/rapidsai/notebooks#cuml-notebooks)
+- [Direct Link to Notebooks](https://github.com/rapidsai/notebooks/tree/branch-0.18/cuml)
+
+
+### Conclusion to Sections 1 and 2
+Here ends the basics of cuDF, cuML, Dask, and XGBoost. These are libraries that everyone who uses RAPIDS will go to every day. Our next sections will cover libraries that are more niche in usage, but are powerful to accomplish your analytics.
+
+## **3. Graphs on RAPIDS: Intro to cuGraph**
+
+It is often useful to look at the relationships contained in the data, which we do that thought the use of graph analytics. Representing data as a graph is an extremely powerful techniques that has grown in popularity. Graph analytics are used to helps Netflix recommend shows, Google rank sites in their search engine, connects bits of discrete knowledge into a comprehensive corpus, schedules NFL games, and can even help you optimize seating for your wedding (and it works too!). [KDNuggests has a great in depth guide to graphs here](https://www.kdnuggets.com/2017/12/graph-analytics-using-big-data.html). Up until now, running a graph analytics was a painfully slow, particularly as the size of the graph (number of nodes and edges) grew.
+
+[RAPIDS' cuGraph library makes graph analytics effortless, as it boasts some of our best speedups](https://www.zdnet.com/article/nvidia-rapids-cugraph-making-graph-analysis-ubiquitous/), (up to 25,000x). To put it in persepctive, what can take over 20 hours, cuGraph can lets you do in less than a minute (3 seconds). In this section, we'll look at some examples of cuGraph methods for your graph analytics and look at a simple use case.
+- [latest RAPIDS cuGraph documentation](https://docs.rapids.ai/api)
+- [RAPIDS cuGraph GitHub repo](https://github.com/rapidsai/cugraph)
+
+### RAPIDS cuGraph Example Notebooks
+- [Index of Notebooks](https://github.com/rapidsai/notebooks/#cugraph-notebooks)
+- [Direct Link to Notebooks](https://github.com/rapidsai/notebooks/tree/branch-0.18/cugraph)
+"""
\ No newline at end of file
diff --git a/getting_started_notebooks/basics/Dask_Hello_World.ipynb b/getting_started_materials/hello_worlds/Dask_Hello_World.ipynb
similarity index 100%
rename from getting_started_notebooks/basics/Dask_Hello_World.ipynb
rename to getting_started_materials/hello_worlds/Dask_Hello_World.ipynb
diff --git a/getting_started_notebooks/basics/blazingsql/README.md b/getting_started_materials/hello_worlds/blazingsql/README.md
similarity index 100%
rename from getting_started_notebooks/basics/blazingsql/README.md
rename to getting_started_materials/hello_worlds/blazingsql/README.md
diff --git a/getting_started_notebooks/basics/blazingsql/federated_query_demo.ipynb b/getting_started_materials/hello_worlds/blazingsql/federated_query_demo.ipynb
similarity index 100%
rename from getting_started_notebooks/basics/blazingsql/federated_query_demo.ipynb
rename to getting_started_materials/hello_worlds/blazingsql/federated_query_demo.ipynb
diff --git a/getting_started_notebooks/basics/blazingsql/getting_started_with_blazingsql.ipynb b/getting_started_materials/hello_worlds/blazingsql/getting_started_with_blazingsql.ipynb
similarity index 100%
rename from getting_started_notebooks/basics/blazingsql/getting_started_with_blazingsql.ipynb
rename to getting_started_materials/hello_worlds/blazingsql/getting_started_with_blazingsql.ipynb
diff --git a/getting_started_notebooks/basics/hello_streamz.ipynb b/getting_started_materials/hello_worlds/hello_streamz.ipynb
similarity index 100%
rename from getting_started_notebooks/basics/hello_streamz.ipynb
rename to getting_started_materials/hello_worlds/hello_streamz.ipynb
diff --git a/getting_started_materials/intro_tutorials_and_guides/01_Introduction_to_RAPIDS.ipynb b/getting_started_materials/intro_tutorials_and_guides/01_Introduction_to_RAPIDS.ipynb
new file mode 100644
index 00000000..f08767b4
--- /dev/null
+++ b/getting_started_materials/intro_tutorials_and_guides/01_Introduction_to_RAPIDS.ipynb
@@ -0,0 +1,1101 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## Introduction to RAPIDS\n",
+ "#### By Paul Hendricks\n",
+ "-------\n",
+ "\n",
+ "While the world’s data doubles each year, CPU computing has hit a brick wall with the end of Moore’s law. For the same reasons, scientific computing and deep learning has turned to NVIDIA GPU acceleration, data analytics and machine learning where GPU acceleration is ideal. \n",
+ "\n",
+ "NVIDIA created RAPIDS – an open-source data analytics and machine learning acceleration platform that leverages GPUs to accelerate computations. RAPIDS is based on Python, has Pandas-like and Scikit-Learn-like interfaces, is built on Apache Arrow in-memory data format, and can scale from 1 to multi-GPU to multi-nodes. RAPIDS integrates easily into the world’s most popular data science Python-based workflows. RAPIDS accelerates data science end-to-end – from data prep, to machine learning, to deep learning. And through Arrow, Spark users can easily move data into the RAPIDS platform for acceleration.\n",
+ "\n",
+ "In this notebook, we will discuss and show at a high level what each of the packages in the RAPIDS are as well as what they do. Subsequent notebooks will dive deeper into the various areas of data science and machine learning and show how you can use RAPIDS to accelerate your workflow in each of these areas.\n",
+ "\n",
+ "**Table of Contents**\n",
+ "\n",
+ "* [Introduction to RAPIDS](#introduction)\n",
+ "* [Setup](#setup)\n",
+ "* [Pandas](#pandas)\n",
+ "* [cuDF](#cudf)\n",
+ "* [Scikit-Learn](#scikitlearn)\n",
+ "* [cuML](#cuml)\n",
+ "* [Dask](#dask)\n",
+ "* [Dask cuDF](#daskcudf)\n",
+ "* [Conclusion](#conclusion)\n",
+ "\n",
+ "Before going any further, let's make sure we have access to `matplotlib`, a popular Python library for visualizing data. The Conda install of RAPIDS no longer includes it by default, but the Docker install does."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "try:\n",
+ " import matplotlib\n",
+ "except ModuleNotFoundError:\n",
+ " os.system('conda install -y matplotlib')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## Setup\n",
+ "\n",
+ "This notebook was tested using the [RAPIDS Stable Conda channel, versions 0.17 and 0.18](https://anaconda.org/rapidsai/rapids), and the following Docker containers:\n",
+ "\n",
+ "* `rapidsai/rapidsai-dev:0.18-cuda10.2-devel-ubuntu18.04-py3.7` container from [DockerHub](https://hub.docker.com/r/rapidsai/rapidsai)\n",
+ "\n",
+ "This notebook was run on the NVIDIA GV100 GPU, the Quardo RTX8000, and the T4. Please be aware that your system may be different and you may need to modify the code or install packages to run the below examples. \n",
+ "\n",
+ "If you think you have found a bug or an error, please file an issue here: https://github.com/rapidsai/notebooks-contrib/issues\n",
+ "\n",
+ "Before we begin, let's check out our hardware setup by running the `nvidia-smi` command."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Tue Apr 6 13:15:36 2021 \n",
+ "+-----------------------------------------------------------------------------+\n",
+ "| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |\n",
+ "|-------------------------------+----------------------+----------------------+\n",
+ "| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n",
+ "| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n",
+ "|===============================+======================+======================|\n",
+ "| 0 Quadro RTX 8000 On | 00000000:42:00.0 Off | Off |\n",
+ "| 33% 36C P8 41W / 260W | 34515MiB / 48601MiB | 0% Default |\n",
+ "+-------------------------------+----------------------+----------------------+\n",
+ "| 1 Quadro RTX 8000 On | 00000000:43:00.0 Off | Off |\n",
+ "| 33% 42C P8 42W / 260W | 211MiB / 48598MiB | 0% Default |\n",
+ "+-------------------------------+----------------------+----------------------+\n",
+ " \n",
+ "+-----------------------------------------------------------------------------+\n",
+ "| Processes: GPU Memory |\n",
+ "| GPU PID Type Process name Usage |\n",
+ "|=============================================================================|\n",
+ "| 0 4987 C .../miniconda3/envs/rapids-0.16/bin/python 22299MiB |\n",
+ "| 0 23869 C .../miniconda3/envs/rapids-0.18/bin/python 721MiB |\n",
+ "| 0 89935 C ...an/miniconda3/envs/0.17-test/bin/python 11483MiB |\n",
+ "| 1 2156 G /usr/lib/xorg/Xorg 39MiB |\n",
+ "| 1 2313 G /usr/bin/gnome-shell 85MiB |\n",
+ "| 1 22643 G /usr/lib/xorg/Xorg 64MiB |\n",
+ "| 1 22689 G /usr/bin/gnome-shell 8MiB |\n",
+ "+-----------------------------------------------------------------------------+\n"
+ ]
+ }
+ ],
+ "source": [
+ "!nvidia-smi"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Next, let's see what CUDA version we have. If it's not found, that's okay, you may not have nvcc or be in a Docker container."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "nvcc: NVIDIA (R) Cuda compiler driver\n",
+ "Copyright (c) 2005-2017 NVIDIA Corporation\n",
+ "Built on Fri_Nov__3_21:07:56_CDT_2017\n",
+ "Cuda compilation tools, release 9.1, V9.1.85\n"
+ ]
+ }
+ ],
+ "source": [
+ "!nvcc --version"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Next, let's load some helper functions from `matplotlib` and configure the Jupyter Notebook for visualization."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from matplotlib.colors import ListedColormap\n",
+ "import matplotlib.pyplot as plt\n",
+ "\n",
+ "\n",
+ "%matplotlib inline"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Let's see how much GPU memory is available. Since this is a tutorial, we want to keep that data as big as possible without you running out of memory (OOM)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "your GPU has 48 GB\n"
+ ]
+ }
+ ],
+ "source": [
+ "from pynvml.smi import nvidia_smi\n",
+ "nvsmi = nvidia_smi.getInstance()\n",
+ "gpus = nvsmi.DeviceQuery()\n",
+ "\n",
+ "gpu_mem = int(gpus['gpu'][0]['fb_memory_usage']['total']/1000) #gets your memory size of your first found GPU in GB\n",
+ "print(\"your GPU has\", gpu_mem, \"GB\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## Pandas\n",
+ "\n",
+ "Data scientists typically work with two types of data: unstructured and structured. Unstructured data often comes in the form of text, images, or videos. Structured data - as the name suggests - comes in a structured form, often represented by a table or CSV. We'll focus the majority of these tutorials on working with these types of data.\n",
+ "\n",
+ "There exist many tools in the Python ecosystem for working with structured, tabular data but few are as widely used as Pandas. Pandas represents data in a table and allows a data scientist to manipulate the data to perform a number of useful operations such as filtering, transforming, aggregating, merging, visualizing and many more. \n",
+ "\n",
+ "For more information on Pandas, check out the excellent documentation: http://pandas.pydata.org/pandas-docs/stable/\n",
+ "\n",
+ "Below we show how to create a Pandas DataFrame, an internal object for representing tabular data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Pandas Version: 1.1.4\n",
+ " key value\n",
+ "0 0 10.0\n",
+ "1 0 11.0\n",
+ "2 2 12.0\n",
+ "3 2 13.0\n",
+ "4 3 14.0\n"
+ ]
+ }
+ ],
+ "source": [
+ "import pandas as pd; print('Pandas Version:', pd.__version__)\n",
+ "\n",
+ "\n",
+ "# here we create a Pandas DataFrame with\n",
+ "# two columns named \"key\" and \"value\"\n",
+ "df = pd.DataFrame()\n",
+ "df['key'] = [0, 0, 2, 2, 3]\n",
+ "df['value'] = [float(i + 10) for i in range(5)]\n",
+ "print(df)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can perform many operations on this data. For example, let's say we wanted to sum all values in the in the `value` column. We could accomplish this using the following syntax:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "60.0\n"
+ ]
+ }
+ ],
+ "source": [
+ "aggregation = df['value'].sum()\n",
+ "print(aggregation)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## cuDF\n",
+ "\n",
+ "Pandas is fantastic for working with small datasets that fit into your system's memory. However, datasets are growing larger and data scientists are working with increasingly complex workloads - the need for accelerated compute arises.\n",
+ "\n",
+ "cuDF is a package within the RAPIDS ecosystem that allows data scientists to easily migrate their existing Pandas workflows from CPU to GPU, where computations can leverage the immense parallelization that GPUs provide.\n",
+ "\n",
+ "Below, we show how to create a cuDF DataFrame."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "cuDF Version: 0.17.0a+382.gbd321d1e93\n",
+ " key value\n",
+ "0 0 10.0\n",
+ "1 0 11.0\n",
+ "2 2 12.0\n",
+ "3 2 13.0\n",
+ "4 3 14.0\n"
+ ]
+ }
+ ],
+ "source": [
+ "import cudf; print('cuDF Version:', cudf.__version__)\n",
+ "\n",
+ "\n",
+ "# here we create a cuDF DataFrame with\n",
+ "# two columns named \"key\" and \"value\"\n",
+ "df = cudf.DataFrame()\n",
+ "df['key'] = [0, 0, 2, 2, 3]\n",
+ "df['value'] = [float(i + 10) for i in range(5)]\n",
+ "print(df)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "As before, we can take this cuDF DataFrame and perform a `sum` operation over the `value` column. The key difference is that any operations we perform using cuDF use the GPU instead of the CPU."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "60.0\n"
+ ]
+ }
+ ],
+ "source": [
+ "aggregation = df['value'].sum()\n",
+ "print(aggregation)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Note how the syntax for both creating and manipulating a cuDF DataFrame is identical to the syntax necessary to create and manipulate Pandas DataFrames; the cuDF API is based on the Pandas API. This design choice minimizes the cognitive burden of switching from a CPU based workflow to a GPU based workflow and allows data scientists to focus on solving problems while benefitting from the speed of a GPU!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## Scikit-Learn\n",
+ "\n",
+ "After our data has been preprocessed, we often want to build a model so as to understand the relationships between different variables in our data. Scikit-Learn is an incredibly powerful toolkit that allows data scientists to quickly build models from their data. Below we show a simple example of how to create a Linear Regression model."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "NumPy Version: 1.19.5\n"
+ ]
+ }
+ ],
+ "source": [
+ "import numpy as np; print('NumPy Version:', np.__version__)\n",
+ "\n",
+ "\n",
+ "# create the relationship: y = 2.0 * x + 1.0\n",
+ "if(gpu_mem <= 16):\n",
+ " n_rows = 35000 # let's use 35 thousand data points. Very small GPU memory sizes will require you to reduce this number further \n",
+ "elif(gpu_mem > 17):\n",
+ " n_rows = 100000 # let's use 100 thousand data points\n",
+ "w = 2.0\n",
+ "x = np.random.normal(loc=0, scale=1, size=(n_rows,))\n",
+ "b = 1.0\n",
+ "y = w * x + b\n",
+ "\n",
+ "# add a bit of noise\n",
+ "noise = np.random.normal(loc=0, scale=2, size=(n_rows,))\n",
+ "y_noisy = y + noise"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can now visualize our data using the `matplotlib` library."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "execution_count": 11,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ "
"
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "plt.scatter(x, y_noisy, label='empirical data points')\n",
+ "plt.plot(x, y, color='black', label='true relationship')\n",
+ "plt.plot(inputs, outputs, color='red', label='predicted relationship (cpu)')\n",
+ "plt.legend()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## cuML\n",
+ "\n",
+ "The mathematical operations underlying many machine learning algorithms are often matrix multiplications. These types of operations are highly parallelizable and can be greatly accelerated using a GPU. cuML makes it easy to build machine learning models in an accelerated fashion while still using an interface nearly identical to Scikit-Learn. The below shows how to accomplish the same Linear Regression model but on a GPU.\n",
+ "\n",
+ "First, let's convert our data from a NumPy representation to a cuDF representation."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ " x y\n",
+ "0 -0.628841 -1.821668\n",
+ "1 1.490274 7.164684\n",
+ "2 -1.108334 -2.714711\n",
+ "3 -0.270642 -0.874697\n",
+ "4 1.600833 -1.727782\n"
+ ]
+ }
+ ],
+ "source": [
+ "# create a cuDF DataFrame\n",
+ "df = cudf.DataFrame({'x': x, 'y': y_noisy})\n",
+ "print(df.head())"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Next, we'll load the GPU accelerated `LinearRegression` class from cuML, instantiate it, and fit it to our data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "cuML Version: 0.17.0a+173.g2c0aacf44\n"
+ ]
+ }
+ ],
+ "source": [
+ "import cuml; print('cuML Version:', cuml.__version__)\n",
+ "from cuml.linear_model import LinearRegression as LinearRegression_GPU\n",
+ "\n",
+ "\n",
+ "# instantiate and fit model\n",
+ "linear_regression_gpu = LinearRegression_GPU()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/internals/api_decorators.py:410: UserWarning: Changing solver from 'eig' to 'svd' as eig solver does not support training data with 1 column currently.\n",
+ " return func(*args, **kwargs)\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "CPU times: user 415 ms, sys: 208 ms, total: 623 ms\n",
+ "Wall time: 2.48 s\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ "LinearRegression(algorithm='eig', fit_intercept=True, normalize=False, handle=, verbose=4, output_type='input')"
+ ]
+ },
+ "execution_count": 18,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "%%time\n",
+ "\n",
+ "linear_regression_gpu.fit(df[['x']], df['y'])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can use this model to predict values for new data points, a step often called \"inference\" or \"scoring\". All model fitting and predicting steps are GPU accelerated."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# create new data and perform inference\n",
+ "new_data_df = cudf.DataFrame({'inputs': inputs})\n",
+ "outputs_gpu = linear_regression_gpu.predict(new_data_df[['inputs']])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Lastly, we can overlay our predicted relationship using our GPU accelerated Linear Regression model (green line) over our empirical data points (light blue circles), the true relationship (blue line), and the predicted relationship from a model built on the CPU (red line). We see that our GPU accelerated model's estimate of the true relationship (green line) is identical to the CPU based model's estimate of the true relationship (red line)!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "execution_count": 20,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ "
"
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "plt.scatter(x, y_noisy, label='empirical data points')\n",
+ "plt.plot(x, y, color='black', label='true relationship')\n",
+ "plt.plot(inputs, outputs, color='red', label='predicted relationship (cpu)')\n",
+ "plt.plot(inputs, outputs_gpu.to_array(), color='green', label='predicted relationship (gpu)')\n",
+ "plt.legend()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## Dask\n",
+ "\n",
+ "Dask is a library the allows facillitates distributed computing. Written in Python, it allows one to compose complex workflows using basic Python primitives like integers or strings as well as large data structures like those found in NumPy, Pandas, and cuDF. In the following examples and notebooks, we'll show how to use Dask with cuDF to accelerate common ETL tasks and train machine learning models like Linear Regression and XGBoost.\n",
+ "\n",
+ "To learn more about Dask, check out the documentation here: http://docs.dask.org/en/latest/\n",
+ "\n",
+ "#### Client/Workers\n",
+ "\n",
+ "Dask operates by creating a cluster composed of a \"client\" and multiple \"workers\". The client is responsible for scheduling work; the workers are responsible for actually executing that work. \n",
+ "\n",
+ "Typically, we set the number of workers to be equal to the number of computing resources we have available to us. For CPU based workflows, this might be the number of cores or threads on that particlular machine. For example, we might set `n_workers = 8` if we have 8 CPU cores or threads on our machine that can each operate in parallel. This allows us to take advantage of all of our computing resources and enjoy the most benefits from parallelization.\n",
+ "\n",
+ "To get started, we'll create a local cluster of workers and client to interact with that cluster."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Dask Version: 2.30.0\n"
+ ]
+ }
+ ],
+ "source": [
+ "import dask; print('Dask Version:', dask.__version__)\n",
+ "from dask.distributed import Client, LocalCluster\n",
+ "\n",
+ "\n",
+ "# create a local cluster with 4 workers\n",
+ "n_workers = 1\n",
+ "cluster = LocalCluster(n_workers=n_workers)\n",
+ "client = Client(cluster)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Let's inspect the `client` object to view our current Dask status. We should see the IP Address for our Scheduler as well as the the number of workers in our Cluster. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
"
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "execution_count": 22,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# show current Dask status\n",
+ "client"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "You can also see the status and more information at the Dashboard, found at `http:///status`. You can ignore this for now, we'll dive into this in subsequent tutorials.\n",
+ "\n",
+ "With our client and cluster of workers setup, it's time to execute our first distributed program. We'll define a function called `sleep_1` that sleeps for 1 second and returns the string \"Success!\". Executed in serial four times, this function should take around 4 seconds to execute."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 23,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import time\n",
+ "\n",
+ "\n",
+ "def sleep_1():\n",
+ " time.sleep(1)\n",
+ " return 'Success!'"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 24,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "CPU times: user 7.64 ms, sys: 7.32 ms, total: 15 ms\n",
+ "Wall time: 1 s\n"
+ ]
+ }
+ ],
+ "source": [
+ "%%time\n",
+ "\n",
+ "for _ in range(n_workers):\n",
+ " sleep_1()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "As expected, our workflow takes about 4 seconds to run. Now let's execute this same workflow in distributed fashion using Dask."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from dask.delayed import delayed"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "['Success!']\n",
+ "CPU times: user 9.37 ms, sys: 11.2 ms, total: 20.6 ms\n",
+ "Wall time: 1.01 s\n"
+ ]
+ }
+ ],
+ "source": [
+ "%%time\n",
+ "\n",
+ "# define delayed execution graph\n",
+ "sleep_operations = [delayed(sleep_1)() for _ in range(n_workers)]\n",
+ "\n",
+ "# use client to perform computations using execution graph\n",
+ "sleep_futures = client.compute(sleep_operations, optimize_graph=False, fifo_timeout=\"0ms\")\n",
+ "\n",
+ "# collect and print results\n",
+ "sleep_results = client.gather(sleep_futures)\n",
+ "print(sleep_results)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Using Dask, we see that this whole workflow takes a little over a second - each worker is truly executing in parallel!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## Dask cuDF\n",
+ "\n",
+ "In the previous example, we saw how we can use Dask with very basic objects to compose a graph that can be executed in a distributed fashion. However, we aren't limited to basic data types though. \n",
+ "\n",
+ "We can use Dask with objects such as Pandas DataFrames, NumPy arrays, and cuDF DataFrames to compose more complex workflows. With larger amounts of data and embarrasingly parallel algorithms, Dask allows us to scale ETL and Machine Learning workflows to Gigabytes or Terabytes of data. In the below example, we show how we can process 100 million rows by combining cuDF with Dask.\n",
+ "\n",
+ "Before we start working with cuDF DataFrames with Dask, we need to setup a Local CUDA Cluster and Client to work with our GPUs. This is very similar to how we setup a Local Cluster and Client in vanilla Dask."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import dask; print('Dask Version:', dask.__version__)\n",
+ "from dask.distributed import Client\n",
+ "# import dask_cuda; print('Dask CUDA Version:', dask_cuda.__version__)\n",
+ "from dask_cuda import LocalCUDACluster\n",
+ "\n",
+ "\n",
+ "# create a local CUDA cluster\n",
+ "cluster = LocalCUDACluster()\n",
+ "client = Client(cluster)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Let's inspect our `client` object:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 28,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
"
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "plt.scatter(x, y_noisy, label='empirical data points')\n",
+ "plt.plot(x, y, color='black', label='true relationship')\n",
+ "plt.plot(inputs, outputs, color='red', label='predicted relationship (cpu)')\n",
+ "plt.legend()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The mathematical operations underlying many machine learning algorithms are often matrix multiplications. These types of operations are highly parallelizable and can be greatly accelerated using a GPU. cuML makes it easy to build machine learning models in an accelerated fashion while still using an interface nearly identical to Scikit-Learn. The below shows how to accomplish the same Linear Regression model but on a GPU.\n",
+ "\n",
+ "First, let's convert our data from a NumPy representation to a cuDF representation."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "cuDF Version: 0.17.0a+382.gbd321d1e93\n",
+ " x y\n",
+ "0 0.179561 3.154744\n",
+ "1 -0.714866 0.043070\n",
+ "2 -1.555288 -5.391598\n",
+ "3 -0.554378 -4.569417\n",
+ "4 -0.280322 1.784538\n"
+ ]
+ }
+ ],
+ "source": [
+ "import cudf; print('cuDF Version:', cudf.__version__)\n",
+ "\n",
+ "\n",
+ "# create a cuDF DataFrame\n",
+ "df = cudf.DataFrame({'x': x, 'y': y_noisy})\n",
+ "print(df.head())"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Next, we'll load the GPU accelerated `LinearRegression` class from cuML, instantiate it, and fit it to our data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "cuML Version: 0.17.0a+173.g2c0aacf44\n"
+ ]
+ }
+ ],
+ "source": [
+ "import cuml; print('cuML Version:', cuml.__version__)\n",
+ "from cuml.linear_model import LinearRegression as LinearRegression_GPU\n",
+ "\n",
+ "\n",
+ "# instantiate and fit model\n",
+ "linear_regression_gpu = LinearRegression_GPU()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "CPU times: user 372 ms, sys: 140 ms, total: 511 ms\n",
+ "Wall time: 508 ms\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/internals/api_decorators.py:410: UserWarning: Changing solver from 'eig' to 'svd' as eig solver does not support training data with 1 column currently.\n",
+ " return func(*args, **kwargs)\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ "LinearRegression(algorithm='eig', fit_intercept=True, normalize=False, handle=, verbose=4, output_type='input')"
+ ]
+ },
+ "execution_count": 13,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "%%time\n",
+ "\n",
+ "linear_regression_gpu.fit(df['x'], df['y'])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can use this model to predict values for new data points, a step often called \"inference\" or \"scoring\". All model fitting and predicting steps are GPU accelerated."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# create new data and perform inference\n",
+ "new_data_df = cudf.DataFrame({'inputs': inputs})\n",
+ "outputs_gpu = linear_regression_gpu.predict(new_data_df[['inputs']])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Lastly, we can overlay our predicted relationship using our GPU accelerated Linear Regression model (green line) over our empirical data points (light blue circles), the true relationship (blue line), and the predicted relationship from a model built on the CPU (red line). We see that our GPU accelerated model's estimate of the true relationship (green line) is identical to the CPU based model's estimate of the true relationship (red line)!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "execution_count": 15,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ "
"
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "plt.scatter(x, y_noisy, label='empirical data points')\n",
+ "plt.plot(x, y, color='black', label='true relationship')\n",
+ "plt.plot(inputs, outputs, color='red', label='predicted relationship (cpu)')\n",
+ "plt.plot(inputs, outputs_gpu.to_array(), color='green', label='predicted relationship (gpu)')\n",
+ "plt.legend()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## Ridge Regression\n",
+ "\n",
+ "Ridge extends LinearRegression by providing L2 regularization on the coefficients when predicting response y with a linear combination of the predictors in X. It can reduce the variance of the predictors, and improves the conditioning of the problem.\n",
+ "\n",
+ "Below, we instantiate and fit a Ridge Regression model to our data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from cuml.linear_model import Ridge as Ridge_GPU\n",
+ "\n",
+ "\n",
+ "# instantiate and fit model\n",
+ "ridge_regression_gpu = Ridge_GPU()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "CPU times: user 1.13 ms, sys: 4.39 ms, total: 5.52 ms\n",
+ "Wall time: 5.07 ms\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/internals/api_decorators.py:410: UserWarning: Changing solver to 'svd' as 'eig' or 'cd' solvers do not support training data with 1 column currently.\n",
+ " return func(*args, **kwargs)\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ "Ridge(alpha=1.0, solver='eig', fit_intercept=True, normalize=False, handle=, output_type='input', verbose=4)"
+ ]
+ },
+ "execution_count": 17,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "%%time\n",
+ "\n",
+ "ridge_regression_gpu.fit(df[['x']], df['y'])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Similar to the `LinearRegression` model we fitted early, we can use the `predict` method to generate predictions for new data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "outputs_gpu = ridge_regression_gpu.predict(new_data_df[['inputs']])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Lastly, we can visualize our `Ridge` model's estimated relationship and overlay it our the empirical data points."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "execution_count": 19,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ "
"
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "plt.scatter(x, y_noisy, label='empirical data points')\n",
+ "plt.plot(x, y, color='black', label='true relationship')\n",
+ "plt.plot(inputs, outputs, color='red', label='linear regression (cpu)')\n",
+ "plt.plot(inputs, outputs_gpu.to_array(), color='green', label='ridge regression (gpu)')\n",
+ "plt.legend()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## K Nearest Neighbors\n",
+ "\n",
+ "NearestNeighbors is a unsupervised algorithm where if one wants to find the “closest” datapoint(s) to new unseen data, one can calculate a suitable “distance” between each and every point, and return the top K datapoints which have the smallest distance to it.\n",
+ "\n",
+ "We'll generate some fake data using the `make_moons` function from the `sklearn.datasets` module. This function generates data points from two equations, each describing a half circle with a unique center. Since each data point is generated by one of these two equations, the cluster each data point belongs to is clear. The ideal classification algorithm will identify two clusters and associate each data point with the equation that generated it. \n",
+ "\n",
+ "These data points are generated using a non-linear relationship - so using a linear regression approach won't adequately solve problem. Instead, we can use a distance-based algorithm K Nearest Neighbors to classify each data point.\n",
+ "\n",
+ "First, let's generate out data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "(1000, 2)\n"
+ ]
+ }
+ ],
+ "source": [
+ "from sklearn.datasets import make_moons\n",
+ "\n",
+ "\n",
+ "X, y = make_moons(n_samples=int(1e3), noise=0.05, random_state=0)\n",
+ "print(X.shape)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Let's visualize our data:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/png": "iVBORw0KGgoAAAANSUhEUgAAAagAAAEYCAYAAAAJeGK1AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAACGfElEQVR4nO2deVhUZfvHPw87zDAwgIA7WppmaamZpuWWqZW2l5WZRaG5oWBq1s/Kt0x9FXdT08ryrbfedrXct7SsbLFyt8QlRBSGVZDt+f0BZ5wZzgwguz6f65oLmDnnzDnMOXOf536+9/cWUkoUCoVCoahtuNX0DigUCoVCoYcKUAqFQqGolagApVAoFIpaiQpQCoVCoaiVqAClUCgUilqJR03vwKUQEhIiIyIiano3FAqFQlEJ/Pzzz+eklPUcn6+TASoiIoI9e/bU9G4oFAqFohIQQhzXe16l+BQKhUJRK1EBSqFQKBS1EhWgFAqFQlErqZNzUAqFQlFT5OXlcerUKXJycmp6V+ocPj4+NGrUCE9PzzItrwKUQqFQlINTp07h7+9PREQEQoia3p06g5SS5ORkTp06RbNmzcq0jkrxKRQKRTnIyckhODhYBadyIoQgODi4XCNPFaAUdY7MzEwOHz5MZmZmTe+K4gpFBadLo7z/NxWgFHWG/Px8YmLH07BRY3r37U/DRo2JiR1Pfn5+Te+aQqGoAlSAUtQZJkycxPYf9xC3eivz1u0kbvVWtv+4hwkTJ9X0rikUNc4rr7zCrFmzyr1eamoqixcvrvD7L1y4kKuvvhohBOfOnavw9kAFKEUdITMzkxUrVjBi2lzMoWEAmEPDGDFtLm+//bZK9ylqNbU5LX0pAUpKSWFhod1zXbt2ZdOmTTRt2rTS9k0FKEWdICEhAZM5yBqcNMyhYZgCzSQkJNTQnikUzqmqtPR7771H27ZtadeuHU888USJ13v06GG1gzt37hyad+m+ffvo1KkTN9xwA23btuXIkSNMmjSJv/76ixtuuIHnn38egH//+9/cdNNNtG3blpdffhmA+Ph4WrduzYgRI2jfvj0nT560e88bb7yRyvZIVTJzRa0hMzOThIQEGjRogNFotHutQYMGpFtSsCSdsQtSlqQzpKdaaNCgQXXvrkJRKrZpaXNoGJakMyyePJYJEycRN7v86TgoCjKvv/46u3btIiQkhJSUlDKvu2TJEqKjo3n88cfJzc2loKCA6dOn8+eff/Lbb78BsGHDBo4cOcKPP/6IlJKBAweyY8cOmjRpwqFDh3jnnXcqJSVYFtQISlHjlOUu02g0EhkZyeLJY7EknQGwXuxPP/10iYCmUNQ0VZWW3rJlCw8++CAhISEABAUFlXndLl26MG3aNGbMmMHx48fx9fUtscyGDRvYsGEDN954I+3bt+fgwYMcOXIEgKZNm9K5c+dL2u9LQQUoRbXhLA9fVvHDzBnT6d6pI7EDezG2XzdiB/aie6eOzJwxvczvpVBUF1WVlpZSlirX9vDwsM4R2dYdPfbYY3z11Vf4+vrSt29ftmzZorv9F154gd9++43ffvuNo0ePEhkZCYDBYLikfb5UVIBSVDmuRkjlucv08PAgbvYsTp08wab133Dq5AmmvvoKf//9t3U5JUVX1BZs09K2VDQt3bt3bz7++GOSk5MBdFN8ERER/PzzzwB88skn1uf//vtvmjdvzpgxYxg4cCC///47/v7+ZGRkWJfp27ev3bX3zz//kJSUdEn7WlFUgFJUOa5GSJdyl2k0GmnevDlTXn6lRCAa//yEcknRK2ukpUZsCkeqKi3dpk0bXnzxRbp37067du2IiYkpscz48eN58803ueWWW+wk3x999BHXXXcdN9xwAwcPHmTIkCEEBwfTtWtXrrvuOp5//nnuuOMOHnvsMbp06cL111/Pgw8+aBfAnDF//nwaNWrEqVOnaNu2Lc8888wlHZ8dUso69+jQoYNU1A0yMjKkKSBQLt/xq/z0YIL1sXzHrzIg0CxPnz7t8vWMjAzd7Y6LiZXtu3W3rrd8x6/yxq63ST+jf5m2lZeXJ8fFxEpTQKBsFNFcmgIC5biYWJmXlyczMjLkoUOHnL63La62o7g82b9/f5mX1c6PgECzbBzRXAYEmq/480Pv/wfskTrf9WoEpahSShshpaenl3qXmZiYyPr160lMTAScTz6PfGMe+fl5+BiMuu9lOxqbMHESW77/gQmL32X6p+uJW72VbT/8RJeu3exGZaPHRLN//36nIyNno8NxMbFqRKXQTUvHzZ6Fh4cSUJcF9V9SVCllkYfPnDGdCRMnETuwV1HQSrXw9NNPM/XVV7jp5s7s3fsbRlMgmemptGt3AyveWuY06PkZTcQf3EfrDp103wuKChOXLlsGQjB/4hjSLcncNuAB6jVqyokjB+0kwbPGRrHqww8pzMsjMjKSmTOmW79ctECpLa/tw4hpcxnRpzOfffkVmWmpJdZTXHkYjUZatmxZ07tR51AjKEWVUpY8vLO7zJ69byerEN7cuJvl3/7Kmxt3k1UIT0U+43Ty+UJ2Fv9bOMtlzj8mNpbGLVpx6133kpWeRmBQCNu++JjtX33KuFmL7YLN+LnLyM/LY9pHa0vMZbkaHZpDw3lh6fvKjkmhqAAqQCkqFT2xgDN5+JT/e8m6rGORbmJiInv3/kZs3BK7gBEbt4Q///idZs2bM2tsVIlAFBUVRa8uNxMzsCdj+t5CzMCedu+VmJjIJ59+SsQ1rUk8Ec+8NdtYtPF7Fm/4nmat27B21Qq74zGHhmEMNCNlYQlloSuVVlZ6GubQcGXHpFBUABWgFKViG3ScqdVcybsdR0jxx/4GoGlEM3rd0Y+w+g0IDQund99+1G/YkNFjovnll18wmgJ1RycGUwAZF3Jp2rI1Ywf0ZESfLozsewtuuTlMnzYNAFkoyc/Pp7CgkG937qRpRAS9+/anxTXX4Obhwc6vv2T09HklRkubP/mQ7Kws6/tZks6QYUmxBhvbuSxno8MFk6Lp9cAgfItrRpQdk0JxaaikuMIp+fn5TJg4iRUrVmAyB5GcdAbh5oY5OISMVIvd3IomFpj20VqkLEQIN95+bbKupcuEiZPY9fMvTPtoLV//523++uM3Rk+fT/2IZtZ5n02bNpGZnqo7d5WZnsZLy/5DROs2PPH8/2FJSiQ3J4eXhzxA9Nhx7Nl3gDlrtmEODWPZK5M4cfQQcau34WMwEn9wH8v/NZmM1FTd4Oft42udwyral2EUFhby8aLZ3DU4kvRUCyaTicOHD5eYP/MPDCTpdAI973uEwTGT7fZZ2TEpFJeAnrSvtj+UzLx60JNyt+1yq7zziUj52n++kO26dJPjYmJlRkaG9DcFyL6DhkiDKUCGN20mDcV/mwICpcViKZZiB8h69RtITy9vaQoKlj5+Bunp5S3DGzeVBlOAHPj0cPnxnyfk8h2/Sm8fX+nj5ydbd7jZ7v1bd7xZGgICrBLyj/88IQc+PVwaTAHSXC9Menp7y76DhsiP/zwhV/18RBpMAXLplp+sy4Q3bSZ9jf7S08tbV47u6e0tffz8ZFBouHWflm75Sbbtcqus3yRCdux0s0tp+shRo0v8z9p36y7HxcTW8KepqCzKIzOvTl5++WX573//u9zrWSwWuWjRogq//2OPPSZbtmwp27RpI5966imZm5uru1y1y8yFEG8LIZKEEH86eV0IIeYLIY4KIX4XQrS3ea2fEOJQ8WtqJrmW4EzKPWbGfDZ+vIr5k6I58ufvLHvrLX755ReEuzuntTmd9buYt2YbiSficff0JCY2tliKvY0lW/fw5qbdNLqqJQHBIby5aTeLNn7PvDXbOH5wP6vipmEODSOwXiggOHn0MMN7d+Lpbm15rk9nvArzKMzL53T8MQBWxU3j+MH9zFuzzSqkSDwRz6q4aViSEvE3B7F21QrrMovW72LB1zsIrBdaYg5rdsxwrr22De7CjZFvzGHp1j08OWEKIQ0aMmbGfCznzlLg6U3c6q1M/3Q9Exa/WyRVnzgJo9FIgwYNGPHccLrc0NalHZMq6lXUJiqr3cbjjz/OwYMH+eOPP8jOzmb58uUV3zm9qFXeB3Ab0B7408nrdwLfAALoDPxQ/Lw78BfQHPAC9gLXlvZ+agRV9Rw6dEg2imhuN8LQHmGNm8oF33wrl275SYY2bCy9fXylp7eTEYmXt/Q3mXRfM5gC5Kqfj9g9ZwwIlAvX7ZIGU4A0BQVLX6O/DAwJlT4Go6zXoJH0NRhlSFh96enlLbvf86A0mAKs21718xG54Jtv5cJ1u6QxIFAu//Y36edvkn7+Jd9/6ZafikZw3t7SXC9Uenp5yQaNGkujySTN9cKkwRQg73wiUs7439fy359tkMu//U16efuUGI35+Zukn9FfPvb449I/IMA6sho1eozct29fmYuDFXWH8oygzP7+EijxMPv7V2gfVq5cKa+//nrZtm1bOXjwYCml/Qiqe/fu8qeffpJSSnn27FnZtGlTKaWUf/75p7zppptku3bt5PXXXy8PHz4sH3nkEenj4yPbtWsnx48fL6WUcubMmbJjx47y+uuvl1OmTJFSSnns2DHZqlUr+dxzz8kbbrhBxsfHO92/uLg4OXnyZN3XyjOCqpQ5KCnlDiFEhItF7gHeK96R3UKIQCFEfSACOCql/BtACPHf4mX3V8Z+KS4dV/VLmkLt40WzCW8SwZiZC1g4eZzunI6vwYCbu4fua/7mICxJifg2u8r6nMEUwMIXoglv3BRPb2/Gz11mrUmaP3EMh3//lezzWQSHhfPdN19hMAViCgpm5cypbP7kQ/zNQWRYUnD38CQzzUL7W3uxf8/uEu8f0qAhQSEh3HrLLaxe/RV+/iZMoeFM/eAruxqozZ98iK/BSFZGOgZ/k91ozHa59Zu3MGf1xecWTx6L54q37ebfqqL1gqJ2Y8nIQOo8L8pgHeSM2t5uIy8vj/fff5958+Zd8jFqVJeKryFg293qVPFzzp4vgRAiSgixRwix5+zZs1W2o1cKpSnzSlOoAWz+5EPGzJhPRKs2ZDiRW+ecP8/5zHQsSWfIzsoi4dhfZGdl2anjbJe3JCUS1iSCk38dsQYnuJhelAUFzPjfNyza+D1zvtrK+cx0Vrz2kl0Kb96abTSIaMbLTz7E9xvXWsUWjvuWaknh8Ml/mPnpBvJycxkc+6LVhUJT9Xl4erJw/S7mfLmZrIw0Nv3vA131X3ZmJpZzZ8nOytKVlquOwIrKora32xgxYgS33XYbt956azmOSp/qClB63vDSxfMln5RymZSyo5SyY7169Sp1564kbOXg9hLvks7fM2dMp8sNbRl7V3eGdW/Pc306E94kgsExk63zO9ooqfeDj7JgUnSJYHbrgPsBeGnwfUT16MDrw4cQ1aMDLz52L/6BZlLPJpFw7C9Oxx9jVnQUPn5Gftq8HpPZrDvqCqwXipRFue/6Ec3o0m8AW7/4WDdoXMg+j6enF0Gh4cyfOMZu32aOjqQwv4Dh/5rN58sXkp+by8LJ4xjWsyMrZ06lID/fbpRXP6IZN/fuj4+vn91+FeTn89W7S8nPy2NW9LPW9U1BwXbS8oSEBIwBgWRnZdrJ2JUEXVFepKy97TZeffVVzp49S1xcXHkPS5fqClCngMY2fzcCElw8r6gibNNMHXr35errb2DBup0lnL81ifn7779PQFAwaakWGjVvwenjx0hPScYcGk56SrL1S39wzGSatrqW6Lt7ENntBqLv7kHTVtcyYOizeHp7ExxWn/lrt7No/S7mr91OcHh98vPyGHdPL6YMeZBxA3sRf2g/E99cyfxvvuV8ZqbLAtiC/HxWzpzKDxu/xs9ocmp75OHlRcded3D0j98YcUcXIrvdwMg7buHYgX34Gv1Zu2oFZxP+YfHG762jL02sYUk6Q2aqxTrKi3rlDTLSLHb7tSpuGvEH9rF44/csthF7rHjtJdIsKTRo0ID8/HwWLFzE2TOneT1qsF0QtCSdIS01RUnQFWWmtrbbWL58OevXr+fDDz/Eza1yQkt1BaivgCHFar7OQJqU8jTwE9BCCNFMCOEFDCpeVuFARZRf2rqJiYksX76c+58bB0JYU3R6KadxMbHWQDZ//S7e3Lgbf3MQOefPM3ZAT8bffwdSSqsSzt3Dg4FDh9H46pbceGtPqwKuIC+f3Owcxs22txBqfHVLQuo3uGhjtGk3zdtczw8b1oCUmMxBzB43zG7UM2tsFF37D8TXYLCq92Z9uoG83AtObI/Oc+F8Fv0GPcmy7b8w7YMvadv1Nq5ueyOzPt/A+cx0Nv3vA8Y4jL5GT5/H5k8+ZM74kXYFt7k5OXh4eFpHY9lZWUXr2/wPfQxGHhoZw9bPPybnQg49e9/O6Ohodv78G29utFcsrnjtJWaNjSI3N5eXpkwp0bNKqf0UetTWdhvDhw/nzJkzdOnShRtuuIGpU6dW+FhFkW6hghsR4kOgBxACnAFeBjwBpJRLRNF4dCHQDzgPPCWl3FO87p3AXIoUfW9LKV8v7f06duwo9+zZU+H9rgs4FsumW1LKbD5qu65/oBlL8lny8vIIDqtPWvI5DCYTS7eW/D8O696BnOws5q7ZXkIgMXZAT+au2UZ2ZgafLVvInz9+x/mMdKs4wWAKoN0ttzF86kwsSWeYPvIpUs4k8taOX6zbyc7KYljPjlahge32R/TpAgK6D3wQHz8DWz79EG9fP7IzMzAHBeFfL4zRb8xj4sN3WtdfOXMqxw/ut6b5tALb44cP4OHhQaPmVxE79y18DEae7d6eF5e+z3frVvPT5vUU5Ofb7ZtGZLd2eHh68fp/viCkQUMsSWeIi32O5m3a4ubmxpZP/4u3rx8F+Xms2LmXgvx8VsVNswo1khMTuLl3P86cOkn8wX28uWl3iWN9rk9nzKHhtLz+BpLPnKZf91uJmz2rQp+5ouo5cOAArVu3LtOyQSYTFp0vd7O/Pynp6ZW9a3UCvf+fEOJnKWVHx2UrS8X3aCmvS2Ckk9e+Br6ujP24HKmI8st23a/eXUr8gX3Wu/3T8ceIufd2XZVeWmoKgUHBTm2GsjMzEMKNXd98xeIN3+FjMGJJSsQcGk5OVibP9enMj5vXcSE7m573PcKOrz6xex9LUiL+Afo2RsGhoXTr0oUNG9ZiCjTj7uZG39t7g4DVX60mMzOTsQN6YAy8OEc1OGYyq+KmMXZATzy8vMjNyeH2hx5j3KxFzBk/Al83iBnQE4QgPy+XhZOiOXfmNI2at+Cfv4/q/g/OF6uvRt95K35GE+cz0un76JM8Mf5F3D08eHhkLInH/+aFRwdiSTrDV+8uLaHumzdhNPUaNORswin9Yw2rz6g35jI1chDuHp4sPXyAKf/3ElP/9ZruZz52XAxjRo+y+hUqaj9XahCqLJQXXy2mIsov23V9DMYS6bz6Ec249e77ShSrLpgUzW13309qcrJu2iwlKZEXH7+PcQN74mf0t4okGjS7Cl+DAR+DER8/P+5/ZhSvrfqMx2NeoPu9DzFvwmjr9oRww3I2SXf7yWeTWLduHU888QRrvvycRx8dxOeff876TZspBHrc/wjTPlzN+Yx06/ruHh48OWEK0z9ay4Xz55n/9Q5rge34ucs4euQIza+6igbNr7am2XrfPwhfo5Fe9z9SQtwxf+IYAkLq4SYEExe9w/h5b+Hh5cU9Tw/HvXgE42swEBgSiru7e5E1k466L3rmAn7auoHzGemcS/inxLFmpqUS0aoNQWH1eXHp+zRp0ZrRY6JLfOamoGDqNWrKsreW0btvP9XGXnHFoAJULeZS2qHrrWuruLOVeg8Y+iwn/zrM2AE9Gdm3K2MH9KRpq2sZ9uoMvH19S8wBzZ84hsCQUG7t2oW1a9ZQkHdx7kcTLUT16IC7uwer5rzB1Gce45nbbmTT/z7g2IE/ea5PZ4b3vInJg+7i+rZtWTBxNAd+/tEqO58/cQz9HhvKnDXb2PXLb9x59wC2//gzC9fv4q3tvzB/7XbiD+5n59df0PPeh0sE14WTx9LnkcGY64Vaj9PHYER4uLN//36rbD07K4udX39JbNwSIl96jaatrrWazo7o05lGV7fktfc/Q0rJJ4vnEN64Cbc/9FgJJWBc7HM0bN6C5MTTeHn76H5OQaHhNLq6JS89cb+uXD8nK5PMVAsRrdowfu5SvvzyS4wOo8tVcdNIPBHPmxt3M2/dLtXCoxZQGVMjVyLl/b+phHYtpizN/kpb93T8MS5knyc9+RzLXpnEzq+/tM4Xdb7jLi6cz2bu6q1IWYg5NBxfgwFL0hkKCwr4a98fRN/dA39zEJmpFno9MIiRr8cxuv+tbNu2jauaX8WiF6IZ+cY8awpx/trt1rTUnNgRJJ85zWurPremFRe/OJbuN3fCy9uLZUuXMSs6iqyMNDw8vej94KMMjpmMu4cHQeEN2ffzj7z6ny/tRiVRU95gwoP9mLtmK+Pv70f0Xd3xDwomw5JCYUEBY2bMtyvaTU9JJj8vF3NIqH2K0SbwPzlhCg+PjMWSlMi/ogbT/7GhhDRoiDk0jPiD+4gZ2BNTgJmUlHM816czpsAg0lNT8DUYyc7KZM6XW5j4UH+nRc3/99YHxNx7O8/16Yx/gJkLOdn0fvBR7hocaed87mswEBAUhOXcOeu2srOy2PzJh3bzddooOnZgL6a++opK91UzPj4+JCcnExwcXKrcW3ERKSXJycn4+PiUeR0VoGoxtsWyWspHrwGfHj4+PrS85hpi770dc71QCqXkxNFDdnMks2OGFxWiTh7H+LlLrcFpwaRouvS7m5+3b2bhul3W+SVNzWYwBZB9PpOEs+fIPHKEUf26UlhQyOKN39t9iY6bvZjou3tYi1/rRzQjZs4you/qTsu2RfJ2q0vEpGjc3Nxw9/AgOyuL79atJqhemHV7tiIEb18/ou/uiQDmrN5Kbk621dnilaEPE94kooTTw/HDB61f+ubQcGthsbZ9X4OBHIOR88Uydi24BAXXY/UXn2EwGGjQoAG39ehJ2oVcXl35CVIW8nrUYOpHNKN38QhLS6PajpLqRzTDz+jPyGlxfL9+Dbs3fM2Orz5j8ycfWoMyFLf2SE1l6NCh1s88OyvT6Xydf2Agu3btomvXrtZzoaiP1l7atWtHeHg4isqnUaNGnDp1CmUYUH58fHxo1KhRmZevFBVfdXMlqvjefvttu3bopSm6YmLHs/3HPYyYNhcvHx+e7d6BRet3OVXNeXp5242Uej/wKOPvu0NXfTZ2QE+mf7SWZa9O4sgfv9Go2VVYziaxdFvJz2TkHbfw4tL3aVBsZ5SdlcXTt1xvF8xst7tky09FI5lnHiMrI92lUi8u9jmuvv4GnpwwBYBTfx1h/P138ObGkvs8ql9Xrr7+Rsb+e+HFNhxHDhE7Z4mdlVJE6zYMHDqMOc+Pwhxcj993beOfUycxGo1kZmbSsFEj4lZvs45uNDWiKSiYd2e8ysaPVxEUGk5Wehq9HhjE4JjJpKckM6JPFxZv2m1NP7497f9IOnXSuj/ajYdmLKt95kZTAElnTuse04g+nQkJq09mWipDhz7Jru++5/ff92I0BZKZnkq7djfw7fZt5bpjVShqAmcqPjUHVctx1g5dLzjZ1jtpE+2moGCmDHkQX4NBf44kLJxOvfrR6KoWjJo2h7lrttGpdz8WTR6Hu4cHs8YO0507qR/RjDEz5lOQn8f9w6NJs+iLKlLPJdnZGcUf3Iefv35hrcEUgCUpESHcSEs+R7c772HBpGhOxx9j8ycflhAhxMx+09pg0JJ0hqVTxhMYFOK0aDcwuJ51rmnrFx+TnnSamIE9ie7XjZF33MLh337m+/VrGXPnbRz9/Vf2/7ybvLw8prz8Cvn5+cXzesF2o67eDz7K/EnRpKckE/niv7jt7vsxBAQw/aO1PDlhCukpycwaG4Wnjzdj7ryNlTOn4uXtzfBXZ9Kw2VWM6NOF0Xd0Yexd3elyQ1vrjYf2mW/ZuJ5hUcNY9IK9kGPW2CjcPTzpeHs//v35Rjbu2s3J04kX68o27iarEG7t3uMSzjqFonagAlQdwWg00rJlS920nmM326tbtMTbr8iS590Zr+Ll7U3uBf1i1uQzp/E3B9Hoqpb865lHGXF7Z2ZFP0v8wf106Xd3USC68zYib73BKqLQUlLaF79/YBBubu4lRATzJ44BIPVskvW5/y2cxYXzWbr7Ykk6w/RhT/DioLu5vm1bkk7GE94kgucf6Iu7u4c1VahhDg3D08ub8QN7EjuwFz27dObCeX0HivOZGfy6cyu+Rn9SEk8z9MmhxP/9F/+cPMnm9d+QeDqByMhIslItNGnZmkXFwowF63ZaBQl6Ld4Hx0wmICi4uKtvV3Zv/BoPd08mPnwnUT06MvKOW2jasjXv7PqjSORxYB+r4qaRnpJM0sl4rrv+etJTLQQGB/P+++9bXTy0z7x58+a4u7tzaO+vjOjThWdva0/03T1oeUMH5q7eyvGD+1m7agWxcUvISk+z8xKMjVvC73v3kpiYWL6TTaGoJagAdRlgW+8Ut2YbnfveTWpyMqfjj7Hti4+ZsGAFtz/0WEk59aRoet33CAnHjnLyyCGatb6OxRu/Z8XOvSze+D2n4//ml+2bkUBWerp1VKBJrYvMYLO4kH0eHz8DEa3b2CkCI1q3wdvXQOx9txNd3BupV5ebiRoWVcKEdvHksURFRbFl43pOnTzB97t2cttNHdi59gsK8vNxc3cjqkcHq0WQtl5hfi5fr/6KUydPsHDBfCIjnymx7fkTx3DHoCeY8fHX1AsPJ2rYMJYueRMPDw9r4A8MDGT6G9Nw93Bn/NylurJ+oISBbnpKMpazSQg3wahpRT2k3vhoNfPWbicrI43ZX2wi6pXpuHt4WA1v133wLiP6dCEx/m8KvXyYu2a7VZ237Yef7NR5EyZO4ts9vzDzk3V4eHkRM2eJXZ+q0dPnseXT/+JjMFp9AzW0UenevXur8OxTKKoOJZKog2RmZpKQkGBV8S1fvpyJb67Ex2BkVdw0zp3+h173P8LCF6LxDyxSq9kWsxoDzaScOc2td99H5EuvFc2FDOzJK+98XMLNe+yAnvz7k3W8PmwwCyePtWt/MSd2BLcNuJ+vViwiN/s8A4cOY+BTwzl+cB9NW7UBKdn22X+J//tv0tPTrQWm2rxa7MBeLufV8vPzadKiJSOnzbO2g58/cQyr4qYxcOgwFk8eS+TTkdx4443WdexbsJtJPnsGNzd3UhJOsf2zj6zvo0dCQgIBZv0CZU3Wb7t9L18/0lNT6HHPQzRrfR0fL5zNmBnz8TUYSDwRj5/Rn/oRzUpsy1wvjKETXyEuZjj/+nC1XTAc+cY8RvfrxqODHqFx48asWLGCuNVbyc7KxGQOonWHTiW2Zww0E39wn647fFZ6Gu3atSvvKaZQ1AqUSKIOoX2xL1++HD9/f7LS02nevDkHDx4kKLw+GSnJFErJ3K+2WNN7Gz9eZTfBnp2VRfzBfbzx3JMs3boHL29vlr4ykR2rPyM4vAEZlhQ7uffIvl15ccl7eHn7MOau2ygsLCTAHEy6JRmDv4mCvDwiIyPJzsnmy7XfkJaSjCkomPSUZAKDgnnovnuZO0ff2dg20NqmLvPz8xkXE8vSZUsJDg0nIy3Vuk/pKcmMuKMLPt4+Lu1/HIO43vvordOwUWOrg4OGJekMMQN78s/Jk9b1MzMzOXHiBIvfXMKqVavwDwjEkpJMQUE+AeYgzmdkkJefx8J1JYUpYwf05OV3PmbasCdY/u2vJfYjsls7hBDkXcjBw8ubNzf9wPuzXy/xWWrbi767B42uaoHlbBLTPvjSTqVpcIOfftjt9JgVitqAEknUQRzNQsc/P4GPPv2MAikRHp4UFBaS6+510Yl77XaatmzN2lUrcPfwIPLFf9Hn4cHMs5kbysnK5MN5M7m5T3+gqAj07D+nihwWXLh5hzRoiJ/RxMMPPcSu7Vs5Hh/P7l07OV7cev29994nIDjEzrE8KCxcv6FKMc7m1SZMnKRrrqq1g68XFs62rVucikUct+1q/s5xHb0eWLPGDSM/v8AqltCWvfbaa1m4YD6nTp5g/ddreGroUDw9PYv2SUCrVq35d/SzuiITNyGc9qnKy81lwbpdzF2znfAmEbz0+L3889cRmlx9DbNjhpcQSxTk59P3tq40rh/OiD5deObWGxnRpwsGN/h2+zaXx6xQ1GbUCKoWomcWOmTIEN5+5x2uvv4GxsyYj4/BSFSPDtbCWA3tjnre2u1kZ2ZgCgrhgzlvsOXzjzCaAsnOyqCwoBBzaBiZqRYKCwuZu3orIQ0althGROvruKrN9Tw5YQqWpDOM7t+NhFOnCAwMtC4bEzueLd//wJE/f9fdl9iBvTh18kSZi0ldjWI0efuLg+4u1zbLg/a/X7J0Cb5Gfy5kXyyqXTplPN07ddT1QLSV9WsjmIWTxnD4j70UFhTgZ/TnQk42Pe9/hNzsbLZ9+QnePj40bH51ia7BEa3bWKXzmqns9P+u4f+GPMCtd93Lrm++whhoJsOSQvvbevHbji0k/HMKo9ForYO6+uqrKSgoUL59ijqBGkHVIWxFDxf7NP1MXl6uTSFoIn5Go+58iaeXN6P6deX14UN47vab+W3Xdp5++mkaNwgnolWbi72L1m6n6TVFI66S2/AiPflccXPCMyx6IZrhw4bbBSfN7++hUeMxOTGXLW8zPlf2TgZTAItfLL1IuSJ4eHgw9dVX8HD3ICbOXpDgzAPRmWfiqOnzyc+9wNT3PuG6Lt24kJPNjq8+48SRgyze8B0rdu6lacvWjLzjFqK6d2BEny5EtG5jVUlq2zGaAkg6dQI/fxNPPP9/LNnyEy8ueY+lW/cwdtYiAoOCrf/jkJAQ1m/YSPsOHXWbUKoWHoq6hApQtQxnX3aPjHkeg039kK/Rn7QU/dqjrIw0Xl/1hTXVZq4XSmFBAfHxx0u0UR8/d5m1lsh2G3kXLpB44hjj7ryN2IG96HHzTSXEBVowcdXyvTRLJkf0pNzatixJZ+jZpbNTkUNlkZCQQEBQMK07dLK6Z4DzgOsqqAYEhfDuG1MYEvsive4fRM75TOtn4O7hQdQr05n9xSbSLSlIYODQYVaVJMC5hH84n5nBghfGkp+bS1SPDny8aDZhjZtanT9sGx5OmDiJbT/85HBzs4fxz0+wK0VQhrOKuoAKULUMZ192Ea3a2Dl4Z2dm4Gf0122z7mvwx6vYPUALQh988IHTNurePr7EH9xnt43eDz1GYHAIK5YtcVocrAWTnKxM3Zbvi16ILvdox9k80MJJ0URFRbFwwfwq74nkKkjqBVxXy1/IzqJnl87EDOzJd+tWWx3gbakf0QyDyYS3j4/VADc7K4sDP//Iy08+SPM2bZn/9Q5W7PzNrpZKE0K0bHkNRqOR1NRUlixZwsg37AuaR0yby/IVK3QDlzKcVdRmVICqJWipF5PJpPtll5OViYeXF3PHjyj2kAsn98IFwptE2NUehTeJoCA/z05uXHQnH0yqkxFXRpqFacOH2Dma3zU4krSUFNq1a+c0wNgGk7sGR1pbvj/bvT2j+3fTHXWVhZkzptO9U0diB/ZibHH9VM/ONzEnbna5t3UpOAuSzjwQXS0f+XQkCxfMZ9uWLfj6+lpdL2yxJJ0hOzOT/NxcThw5xMi+t/DULdcxK/pZks8k0uTqazAFBQPY1VJF392DJldfw+FDh8jMzGRcTAy+/iUDoI/BSF5urm7gKq1ti0JRkyiRRA2jJ4hoec01FHr6MOKNixPus8ZG4Sskt912K++88w5GUyAp584S0epaRk2bi5SFCOHGwsljadqyNVGvXAwMlqQzjBvQg8cefZQf/9hv/aLSRkvZWVl4envzaPQEIlq1IScrkwWTojlx+AC7d+2kZcuWpe6/5hWYZknh/vvvY05cnN181aXgTIZeHZTXA7G05TXxR+d+A/jn76OMm73Y7rNNPZdEu1u68/v33xLSoBHRDqazTVtdaxVOAIzo04Xx896i+bXXMbZfN7787BO69+hJgZQlxCq/7dzOgknRrNj5W4n9HtuvG5vWf+PyM1YoqhpnIgkVoGoYPfXXoheiSToRz7lzZzEGmEm3JONn8Cc7K5M2113H0jcXM2/efP746xj1I65i1zdfYTAFkHouCV9fP8KaNuP5eW/ZfQEmnoinIDeXa1q14o/ff8dgCrC2fej36JNMjXyU1HNnCawXSmaqha79B/L9N19ZjVJLoyaDSVVS3uNytXxM7Hi2/fAToY0j2Pn1lxhMAaQln8XNzZ3pH6/lhUEDAHTVkJqRrjbvpP2dk5VJ7MBebN2ymYH3P0j7XndYTXVNQcGseO0ltnz+EQJYrFND5VjfpVDUBErFVwtxJoiI/L83SEo6Q7c776Ve/QbM+XILb+34hZ73Pcy+fX9y1z338tH/PsYUVI/HY15gyZafeGnZKmZ/vonCggJ63NyRcQN6MLxnR0b2LfKCW7btZ+as2Uahlw9PPTWUBmGh5Obk8O3qzxlzV3fycnOZ+t4nvLjkPaZ/tJazp44TGRlZ5i+ustYa1TXKe1yulp85Yzo9br6J3etWExho5nxaKg89+CBBIfVwc3PDz9/kUsFoSUq0StF7PTCI1LNJxI2LYvDgwbRo0YJ0S4o11Tp2QE+e7d6e44cP8ObG3fR7/KkSXol69V0KRW1CjaBqkMOHD9O7b3/mrdtp93zCsb/4vyfuJy83lxkff42UhXzzwbv889cRRrw2m7WrVrDpfx/g4+fH+Yx07hg0xOr8MLznTWzbtIGgoCCubtGSqe9/hpePj10zQq02KTU1lbHjYtiwYQM+BiOpKecw2rhDlNbSQ3FpOLpcNGzUmGkfrWXCQ/2RUrLg6x26rTX8/E1cOH8e4eaGl7c3WZkZBAaFcOF8JpGRz5CTk8P2H35kxOtz8TEYGN3/Vuu2bPtpeXp5kXfhAr0feqzU+i6FojpQKb5aiLOi1NPxxxhz120Y/E0UFhTgbw4i+cxpet33CJ4+Ppw8fNCuL9L8SdFEtLqWgUOHMeKOLkQ+HcmokSPo2r0Hebm51g66ml1Q7N092LT+G5YsXaZTXBpNtw43sGD+vBr8z1xZaGneeo2a8tuuorIA2+LdBZOiCWnQiB/Wr+HI4UO8/Mqr7PhxD6PeKPIoPPXXEV6Pepx0Swr+AWZSU87h52cAN3fe3rW3WJiRaBXORN/VnRcWv0Oza68HLq2gWqGoTFSAqqU4zkGdS/iHV596mJSkM0S0bmPvMjApmiO//6p7hx19dw8aX92S1HNnybCkMPjxx9j+488lvujCm0Swe91qDh7YzzWtWus6Nqgvq+rFVijj5uFBTvZ5ZKEksF4oWelpdO0/kKST8dx2UwcKCgqsHoVplhTCGjUh8eRxIq651urCbnWx+P1Xbr37fnZ+/aX1JqXbnfew8+svWbp1j12NlxJLKGoSNQdVi0hMTGT9+vUkJiaWkFTH3NObgJB6uHl4lCiqHTN9HgX5eU77IsUf3M8Lb64kN/cC//nPByXWHz19Htu+/B+DBw8mPT3d6XxHed0fFBVDa1D4z6mT/PDdLo4cOsRDDz1IZmoKAQGB7F63mh4330RuXi6bdn7HrM820KlPfwoLCkhPSaYgP4+mLVvZSdFHTZ9PQX4B8Yf2M2/NNqvPYvzB/QSF2reCv5SCaoWiOlABqhrJycnhpps70yQigkefeJImERF06dqNyS9MYuuWzXy46j3c3ASPx0x2ah3kZzRZi2o1igxGL2AKDsHNzQ0fP4PTolxTYBAjnhte7mJUReWiZzmkCSwaNWrE+++9R2JCAls2rif+2N/k5eWxfPlyUs+dZdIjd3P0j71FTRV3/MKbG3eTeCKeVXHTrNsyh4bhYzDQ+4HH8DEYyUxL48P5Mzn512EyUlOsvbVO/XXEKrRQI2ZFbaNSApQQop8Q4pAQ4qgQokRpuhDieSHEb8WPP4UQBUKIoOLX4oUQfxS/dnnk7Zxwa/ceZBVi15Y7M1/SuGkEA+9/kL79+uPjZ3RpHZSVkcZ7M6eWcI/oduc9nE9PQwg38i7kkJGm75Sdm32eJk2alLsYVVE5OHY/dmU5pAWsqf96zeruPuuLTQghiCmuo4KLo+Mtn/6X7KwsCvLzWfbKJM5nZPDB3Ok83fV6hvXsyD9//8X8tdtZsXMv89du5/BvPzPhwX6cTUjg/fffU9ZHilpHheeghBDuwGGgD3AK+Al4VEq538nyA4BxUspexX/HAx2llOfK+p51cQ4qMTGRJhERTvv5LN26h9SzScTeezuLNnzHV+8utdaz2M5BNWx2NX/s3smZk8cJrBdKdmYGXfsP5PTxY9Rv2ozjhw9wc7vrCAwMZPOu3YyddbEgdO74EfTu2pm5cUX9mcpbjKqoOHp1b4snj3WqonMU0iQc+4vXhw9h0fpdJZYd0acLLy1bxZqVb3Hi6CFi45ZgDg3jdPwxYu7tzeIN3zs993KyMl3uh0JRlTibg6qMb6FOwFEp5d/Fb/Rf4B5AN0ABjwIfVsL71in27t2L0RSom3bTWnU3aHYVPe59iFljoxg3azFrV60g+u4eePv4kpmexh0PD+ZCTjZJ/5zAYDKRejYJH19fNn/2X/yMJo7+8RtCCL7fvIFXXp1KyplEou/uYZ0gDzAHgc39iDb3MfXVVy7LItvahlb3ZitM0SyHYgf2Yuqrr5T4/zt6M5pDw62ja8dgk5KUyIuP30t2VqbdjZCUhQTVCyv13HO1HwpFTVAZAaohcNLm71PAzXoLCiH8gH7AKJunJbBBCCGBpVLKZU7WjQKiAJo0aVIJu129tGvXztqgzvGLRWsKCBD50ms81fV6xtx1G4EhoQBc27EzZ04dZ8tn/6VJy1bWO2FL0hnmTRhNg2ZXcevd9/G/hbPo1eVmPDw8eOedd4hbvRUfg9EqMdZcB17711S7LyAtlaSoWly5nmvCFMfPwXau0Bwahq/BQO8HH2X+xDE2rVfOMGvsMNzdPSjIzyc4rL7de5iLuxKfjj+GlIV2NXG2556r/VAoaoLKCFB6PVOd5Q0HALuklCk2z3WVUiYIIUKBjUKIg1LKHSU2WBS4lkFRiq+iO13dhIeH067dDcyOGW5Nvdi6AkBRgW5uTg652TnM+WqL3ZfJ6fhjjBvYq4QyL3rmAkbc0YXvv/7SWlz7999/230R+ja7quinwaC+gGoQx2Cj4UqYos0VLnoh2uqheNfgSF4e+hCj+nUjIDiElKREbup5B+7ubhz9Y2+J9/Dy9ia8cVNi7u1NUL0wMtJS6XbnPcQf2k/9ps3x8vYudT8UipqgMkQSp4DGNn83ApxplAfhkN6TUiYU/0wCPqcoZXhZ8u32bRjcsLblfq5PZ06fiCcl8TTDenTg9ajBvPDoAHz8/Aht1JgGza6y1qpIWYghIED37tuxBbpS6NVOLlWYolkkje7fjWe7tyfm3tvpdHs/Zvzvawz+JgxGf/bt3skdt3Yl8plIDKYAO1ujFa+9hJevL4s3fM+ijd8zb802jh8+QE5WFm7ubtbWHa72w1F1qBofKqoFKWWFHhSNwv4GmgFewF6gjc5yAUAKYLB5zgD42/z+HdCvtPfs0KGDrMucPn1arl27Vj4bFSV9/AyyVfub5PIdv8pPDybI5Tt+la3a3yT7DhoiPz2YYH0sXLdLenp7W5fTHst3/CoDAs0yIyPD7j3GxcTK9t262223fbfuclxMbI0cs6KIvLw8OS4mVgYEmmXjiOYyINAsx8XEyry8vFLXtVgscuhTT0lTQKB13VGjx8h9+/ZZP/+8vDzZvkNH6ePrJ718fGRAcIj09NI/b7x8fKSv0V96+/pJf1OA7n5o+2sKCJSNIppLf1OA7NjpZmkKCJCNIppLU0BgmfdfoXAGsEfqxRe9J8v7AO6kSMn3F/Bi8XPDgeE2ywwF/uuwXvPigLYX2KetW9qjrgcoKaUcOWq0bHVDe+nnb9L98vD29ZUL1+2yCVqdpI/BWCKYOQs6FfkiVFQ9GRkZ8tChQyVuLMq7rt529u3bJwOCQ6SvwSgDguvJwJB6dueX9ghv2ky+9p8vZKv2N8lBjz6muy3HG52+g4bI1h1vVjc+ikrFWYBSVkfVSGZmJidOnGDe/AW88+47BJiDyc/P1+3TE9mtHVlpaRgCAsg5n0Wfh5+g/+NDeXnow2SmphBSL6xMsvDLtQ3GlY5eHzFtDjInJ8dqQJtzPpMXH7+PRet3OW3hkZOVyXN9OuPh4UlQSD0yUi1ERkYy5f9eomlEM6vqMDsri2E9OzJvzTZlj6WoVJTVUQ1iW5zZ4/Y7eHflSm4b8ADPL1hO7oUc3bmivNxc3tzyI+PnvUVEq+sQbgIvH18eHzcJdzd3vvzsE6et2G25XNtgXOlMmDiJ7T/u0W3hrs11vf3aZHz8jLi5uxMX+1yJ4u5eDwzC12DAHBpGUGg4TVpcQ6ubOjPto7Vs/3EP42Ji7MQ2lqRE/JU9lqIaUSOoakCvOHPW2ChOHz/GhZwcGja/inH/Xkz9iGZWZV9E6zbWDqpnTh4n5t4+FOTnYTQFkpmWSuvWrflh9/f4+PjU8NEpqhtnLvi2IxkfHx8mTJzEsmVL8fDx5ba772fjx6vw8fMjLzfX6mzv7uFhHU1N/2gtMff2xsPTi1vvupfvvv4SBMxZvU2NoBRVihpB1RDOmhKOn7uMC9nZCCFISUxk3MBePNGpNcN730xE6zYMjpls3cbc8aNo2qIVL7/9EQvW7eTNTbvJ8/Tm1u49auioFDVJWeqptCLsvb/9xvmMdO55ejjLv/2Ntl2706RlawYOHWYNTtpoqn5EM4LC6jN5yXsknojHw8uLBx94gPkTRnHg5x8B6HbnPcyOGa7ssRTVgvKzqWJcfZn4Gf2JmbOE1h06WUdVx/b/yYXz563LnTudwPHD+3H38GTh5HHWvk5jZy5kdP9bSUxMJDw83PFtFZcx5amnuuqqq6z1d2NnLsRkDuLHjd/wXJ/OBIWGk5WeRq8HBjE4ZrK1cDeiVRtGT5/Hc7ffzC+//sqB/fuZFf0sWRnpuLm5c911bYgZ2JOAwCC7eVCForJRI6gqJjQ0lLOJCbrzTBdysolo1Qa4OKry8PIi/tB+Vrz2EgBv/WsyzVpfx/y1260tE44f3M/aVSswmALYu3dvtR+TomYpbz3Vt9u3kX7mNNF3defo77/x0lv/wcfPSGBwCNM/WsuTE6aQnpJcYl7K29ePXDcv3ty4mxU79/Lmxt00b3M9Qrjxz8mTbFr/TZnmQRWKS0XNQVUxo8dE89Fnn1O/SUQJa5qWN7S3zjNpjOzblVHT5vDq04/gazCSnZnJm5v0DWZzc3I4cTxejaCuQMpj9JuZmUn9Bg3JLyxgxsdfs/6/77Hho/cxGE2cz8rA4B9AXu4Fu3kpzblE79wb0acLx+OPqfNOUWlUpVmswgmZmZm8u/JdZn+xmbWrVjB2QE98/PzIsFhwc3dn3KxFdsvbpljqhYUzecLzvPHv2brpQW8fX5o1baq+JK5Qymr0m5+fz5joaHJzczGYApj0yN00b30dSzb9gDk0jPgD+5g3cTTnMzPt5qXiYoY7dS4xmALYvXs31157rSpfUFQpKsVXhSQkJBAYFExIg4Y8OWEKS7b8xEtvfUDvhx6jsKCAOeNH6Ep/c7IyyUpP54EHHiAj1aLfFyo9ja1bNtfEYSlqEaWVEUyYOInfDh1l8cbvWbBuJ0IIxs1ejCkomJUzpzLlyQfJzc4m7dxZRvXrytNd2xF9V3dOHz9GVnqa3bmXnZXFgZ9/JCPNwhNDniy1n5VCUVFUgKpCGjRoQHJSkvUi9zUYaHx1S+5/dhQSiSXpDKP6deXZ29oTfXcPmra6lrsGR1rnEsLDw3XnGha9EM2IESMICQmpycNT1HI0BalmMmtJSrR2al4VN43jB4vbwW/8njc37aZF2/YUFuTT+KqrGT5smFVccS7hH1bOnEpUjw7EjRuOEIKud91L3JptdvVXCkVlowJUFSPc3OyMOy1JZ5gzfiSyUPL6B18y/+sdtLyhPVJKftj4DaP738otN7azqqJmzphO904diR3Yi7H9uhE7sBc9br5JqaYUpeKsl9Tp+GNs/uRDazPMoteKnPFzss9za6ei8+vb7dvwo5Axd93G4d9+Zv7a7SVazGv9rN5++21lHKuodNQcVBWg2QtlZWVhDg4honUbxg7oiTHQTGaqhU639+Ofv4/y1btL2fzJhxgCApGFBbTqcBOHf/6R0aNGWie6VVNBxaXirJfUgklj8A/Qb55p8DeRnHyO1NRUpr0xncOHDyMLZYk2L6Onz2PsgJ48PDJW9ZFSVBlqBFWJ2Foa9e7bn+49epJyNomBQ4exZMtPvLjkPeau2UbXO+8hOyuDY/v/pNud95CVlkpAUAjfr1uLJfkcc+fNL5HTV5ZFivKiJ0e/a3AkKWeTSD6T6NRi68+/4mncNIL/fvIpbbv1wM/fpBvMjIFmLEmJqo2LospQAaoCOPbEcfRHm7NmG4H1QpkT+xzJiQl8sWIxY+68jYWToiksKKRBRHMST8Rb5wEWb/iOiFZt+PTLrxg/YUINH53icsAxRTx2QA88PTy4dcB9zJswuoRIp/eDjzJh/nLcPTwoKCwkOfG0U7/IDEsKQrgpJwlFlaHqoC4BPSfpIUOGsHLlSuY4+JSdOXmcmHt6Iwslza69jvFzl5GdlcnUZx7jfEa6rq/ZmLu6U1iQz9kzZ9RFr6gUMjMzOXLkCN179GTOmm2YgoJZ+spEvl3zOUFh9clMtVgdJdw9PBh5xy2kJp9l3uptzB43HDd3N2uaz5J0htnjhnPy6EHchVupjvoKRWmoOqhKxHakpF2wCydF4+7lZRdssrOy+GTJPBo1b0nC8b+tF7hPlpHMVAsBxYoqW8yhYZjMQWRlpHPkyBFuvPHG6j48xWWI0WjEYDDYnXNPvTCV79evZdS0OUS0amPt3qyNjkxBIaxdtQIfPz/qN23G2AE9MZgCSD2bRHiTCJCSgwf3q1o8RZWhUnzlxJn566jp8zifkc7p+GMU5OcXyXK7t+enLes5dewonl5emIKCgSK5ec/7HyHlbJLT1EnO+axqPzbF5Y2taAKKzsPbH3qMjxfOJierKE2tpfq63/sQGZYUNv3vA8bMmE/UK9OL6viWrWL2F5s4l5hQ5MWXnl6Th6SoJIJMJoQQdg9Ph7+1R5DJVG37pQJUOXFl/hoQFMLCSWOYP2E0P25eD0JgMAXi5uaGl48vb0296FA+dOLLNL6qBbPGRtnNA8waOww/fxNeXt60aNGiWo9NcXmjiSYWvRBtJ5o4m3CKUf26MaJPF57r05mGV7XgnqeG42sw4u3raz3XfQ0GGjS7ivoRzTAGBJJmSVHCiMsES0YGEuwe+Q5/aw9LRka17ZdK8ZWCY0daV07SmemppFtSOHZoP9fc0JHX3v/MmgKcP3EM3675jHuefo76Ec1IT0nGz99E9vlMou/qjr85iIxUC/l5uQQG1+OZyEg1/6SodGbOmE7UsGGMuKOL3dzTg8OjSUs+x0uD72PH6s/Y+vnHFOTnI2Wh7rluOXuGIU88oUofFFWKGkE5wVEyrlm6+Pj46Lo7zB0/gsCQUCYsehtZKK3GsFA0uhozYz6FBYWMv/8Onr2tPSPu6ELiyeM8P/ctlm77mVFvzKVhs6twd/fgsYcfYta/Z9bk4SsuUzw8PJg/bx7eXt6MmjaHJVt+4skJUzCYAvA1GCnMy0NISZMW1xD3xSb6P/5UiW688yeOwcfPyH/+80EJuyNHZatCURHUCMoJekKIxZPHMnZcDCOeG05eXh6xA3thCjSTZkkhNy+PeWu2kZ5qcVo34ufvT/b584DAaApAADH33l40erKkYDIHMfTJJ4mbPatGjllxZWA0GnnmmWf47M05jJg2F1+DwWqhddXVV7Pvzz9JO3eWiQ/fya0D7ufYgX1E390D/0AzGZYU6jdrTtOWrYieucB6bSx6IZouXbtx+NAhq7I1MjJSqftqMUEmk126ThT/NAMpNbJHJVFnjg6aEMK2pbYpKJh6jZqy7K1lfLlmDekWC0888QQjnhvOq1Onsn7TZgwBgRgCAsk5n6mbFsk5n8Wi9d+BlDzXpzPT/7uGsCYRWJISMYeGk5OVSezAXsyY/oZKmSiqlJkzpjNh4iRiBvbEPyCQjLRUWra8hnwfP2uLDUvSGf495hl8DUY633EnWz/7CH9zEMcPHaDXfY9YRT/m0DCCwhty7MCfTPtoLfUjmllv6CZMnKRuuGop2ryTI0LnuZpCpfh00BNCrIqbRuKJeN7cuJt563YRt3or3/26lyeeHMrnX3yBEG5E9ejAx4tm0/vBx5kTa+9UPid2BH0efgJzvVBry4LUc0nWiWetSZxmGaNwjZ7qyJXCqLzLXynIQkl+fj6FBYXs+/NPRk23T02Pnj6frLRUTh09zIJvvuXNTbvtvPgK8vNZ8fr/sWPNZ5zPSGfiw3eycuZUTEHByqOvDiMoGr0InYfZ37/a9kONoHRwFEJkZ2Wx+ZMP7YpqzaFhhDaO4PjhA7y5cbedGKLpNdfS6KoWjLjjFvyMRvIuXKD3Q48xOKZIxae1ywgMCbV7X2UZU3ac3v05URiVd/nLHS2FrRWWH/j5R+JihpdITQfWC6VQFpaYU9W8+PLz8jh19LDdNbBgUjSr4qbx5IQpyqOvjlJbDBwqZQQlhOgnhDgkhDgqhCjhuy+E6CGESBNC/Fb8mFLWdWsCTY67YOJoDvz8I4nH/y5hrmk5m8S3az8vYaIZ9fJ01n3wDjvWfI6buzt5efk0urolA4cOI/fCBQ78/CP/jn6W0NAw/jP7tTK17FaUxNndnbrjKh29Wr6IVm3IOX++RF1e/MF9+Bn9yc7KJDvrYm2eOTQMX6M/2774WDd4bfn0v5yOP6ZuuBQVosLXsxDCHVgE9AFOAT8JIb6SUu53WPRbKeXdl7hutZKfn0+hLOTw3l+ZFR3F+Yx0hJvgXMI/1l46Gz9ahY+fwXphFuTnsypuGps/+RBjQCBZGelEXNOG0TPmMv6+OxjZ9xYKCwsx+JvIzszg2WejcHd3twotbFt2K0pHq9FwpDblz2sreilsrWh31thhjJ+7FHNoGOcS/mHBpGjOZ2TwetRgMtJSrW3h01OSsSQlYgw0O+26GxcznBYtW+Lj41Pdh6hwwEsI8mp6Jy6Byrjh7AQclVL+DSCE+C9wD1CWIFORdauMCRMn8e1Pv7Bg3U5r2mLu+BG8+tTDXN/lVhJPxPPvT9cz8eE7rWlA2wZw2jrzJo7hm/+8i6+fgaYtWzF6xkXV0+LJY+neqSOnTp5QtSSVjBAXw5TZ358U5XZgh7NavrsGR7Lp4/8wok9na++oZq2v4/X/fGGXvlvx2kscP3yA+hHNOXf6H/06qaRErut0C+cSTjEuJpYF8+fVxKEqismj5A1dEPo3dFpQcFT5aVTnNVVhs1ghxINAPynlM8V/PwHcLKUcZbNMD+BTikZJCcB4KeW+sqxrs40oIAqgSZMmHY4fP16h/XZGZmYmDRs1tlPwAdbutwUFBdZ8+8qZUzl+cD/PTnmDiQ/fqWv8+lyfzkgpWbLphxKvxQ7sxamTJ1RgKiOOF4yzEZR0/FtKhBDOl68l+fbqJCZ2PNt/3GNN89neNOXm5rJ55y5O/HWU+Wu3657T7h6eFOTn4e3jS6OrWtgZyc6fOIZGV7ck8sV/YUk6w4g+nXn22SjmzolTkvMawuX5X/x7EGDRWcZRdl4V14wzs9jKmIPSC8KOe/8L0FRK2Q5YAHxRjnWLnpRymZSyo5SyY7169S51X0vFlZWRr8GIwT/A+trgmMk0bXUtEx7sh6e3t+465nphGIz6dVFKsVc+bO1YyovZ37/GFUm1Cb1Ozd07dWTmjOnMnRNH5xvb4eXknDaaAmjQtBlvbtzN27t+p2nL1oy84xYiu91A9N09iGjdhqETX7Yubw4NZ+v3u1Vb+FqKdi1YcGJtVHO7VikB6hTQ2ObvRhSNkqxIKdOllJnFv38NeAohQsqybnXjaKipYUk6Q1ZGul1vnNwLF+jz0OO8tOw/ZKWl6a+TnkZu7gXd19QEcvWRkp6OlLLE40pN/2mdmk+dPMGm9d9w6uQJ4mbPwsPDAw8PD1568UWy0tN1z9vM9DRi4pZgDg3D3cODqFemM/uLTWSlp9GxZx+enDAF9+KRkiXpDJlpqYx4XUnOayvatVAbqYwA9RPQQgjRTAjhBQwCvrJdQAgRLoonBoQQnYrfN7ks61Y3el1ILUlnmDk6kk69+tHj3oeYP3EMy16ZxLCeHXk9ajCvPP0wAUHBzJ84RrcBXJ+HH2f2uGFKsVeJmHGu4hMUpSs09JyZr/T6Jw1nnZoPHTqEl68PCyZF25238yeOwcvHh8B6oSQc+8uq7Ksf0YygsHC+W7ea0/HHrMvPGjuMznfcWWQwawpg165dKkgpyo7eXWV5H8CdwGHgL+DF4ueGA8OLfx8F7AP2AruBW1ytW9qjQ4cOsirJy8uT42JiZUCgWTZq2kz6Gf2lp7e3DG8SIQ2mABnSoJFs1f4muXzHr/LTgwly+Y5fZdtbbpMt23WQxoBAGda4qfTy9pZ3PhEpP/7zhFy65Sdp8DdJU2CgbBzRXAYEmuW4mFiZl5dXpcdxuQFIqfPQex4nP+1eVzglathw6entLfsOGiKNAYEyvGkzaQwIlD3vHyS9fHyln79JhjdtJg2mADnw6eFy6ZafpDEgUAaHhUs/o7/1NT+jv3x71x+y76Ah0tPbWzaKaCZNAYHq/K9mPPWzd9LT5joo6/VVFdcOsEfqfNerjrouyMzMZEx0NL8dOsrIN+ZhDg3jdPwxYu69ncUbvisxeRx9dw/mrd1OdmYGU595jCnLP8DXYLROPk999RWl2KsAZZnotX1OD23C90oVR5QFTSjUpf9AEk/E8+yUN5CyECHceDVyEMFh4SVEEWcTTtG2y63s/PpL5q3dTuKJeD6cN5OrrmvLhfPnOXH0ELHFaUFbQYayQapeXCnzbIvZnQkmPAD/KlDxORNJqADlAj1FX8Kxv3g9ajCLNn5fYvlhPTvy8tsf4WswMqJPZ4JDw8jKSCfyaWWaWRk4vbgoaW7pSVGtlCMeFEluVYByzuHDh+ndtz9xa7axKm4aWz79L8ZAM+kpyeTlXrCqWDU0hWvENddy4ughgoLrkXz2DG5u7gQEmjmblKi7jlKxVj+ubvK0IKXhSvVa2RL0qlTxXbboKfrMoeFkpKXqix5SksnNyWHu+BF4eHpRWFiILFRfgpWFduJr+Qmz9rzOss6arekFLYU9mlAoPSWZJydMYcmWn3hxyXu0u+U2/AP0C3N9/Ax0vuF6/jlxgs3rvyEpMZHEhH94+62lhNVvoFSsdQBbIVFp6DU4rIpmhipAuUBP0edrMNDtzntKdMKdPykaLx8fJj82kJQzicz5aguLN//InDXb2P7jHiWxrQJqUv56OeMoFPI1GBDCjZ+3bSInu6QdkiXpDOcz0nlj2jQCAwOtoguj0UjXrl1Jt1iUirWWUBkWYarley3BaDQyZMgQFjoomU4fP0ZhQSFjB/RkZN+ujLijCwFBweTmXEAIN6a+/xkhDRoCRXeKytW5Ytg6kTviTM3niiu5/qmsTPm/l2jX8ipiBvRkTN9bmPhQPwKCgrj9ocdKKPtmjR2Gr5+BdJ3UjjNVrFKx1gx6mQVz8fO2alewv55sVbGq5XstID8/nwkTJ7Fy5UrcvbwY0aczfv4m8nJzrX5kuRcuEH9wH9OGD8FyNome9z3M1i8+xsdgf9HZpjOUq3P5sZ28dQw+eum90gKUmntyjnbeL1++HHcvL85nn8fdywsBZKalctfgSNauWsHYAT0xFjcwLCwsxMPNzeloSOs95eg7OeX/XuLw4cNKNFTD2IYb2xS47dyu7Y2fB9WXKlcjKCfYtiNYsXMvcV9uwcvLm0ZXtWDg0GG4e3iQk5XJqtnTkFLSvM31RL70Gn5GE/EH99ltS6UzFHUF7by/5c57aNqyNUs2/cDyb39j7todmOuFMXfCKAYOHcaSLT8xatocIlq1wRwcwjPPPOM0yDgWBccf+xuAphHNSrSMV1Q/zuZrHVPoNTGPq1R8Ojjz4zuX8A/Rd3XH3cMD/6BgMlKS8TX606FHH6JenoYl6Qyj+3ejZdsbdY1hlaT20rBVHjnKX50pjZzd5SnzWOdo5/20j9bqekueS/iHcQN7IWUh3r4Gzmem4+npxTORkcz698wyq1Rd+QCqa6Rq0VPx6ZVpOD7v+Luj4k+jslV8KsWngzM/vpAGDQkIqcfz85bh7etnbdM+ql83Bjz5LG+/NpmoqCjchJtqo1FF2Kb0PNFP56lap0tDO++lLMTfyfkfHFKPLz/7xPpcixYtypWe03pR2d78afO0sQN7MfXVV1S6r4qoTHFDdd3kqQClg616z8dgxJKUaA1GWelphDdtjq/BABSp+nz8/Ii5tzfPPfccs2YW3Umqotyqx3aEpMJQxdHOeyHcyNBpx6Glql0FpczMTJfnvSszZjVPW7VYMjKsoiKN2h4Aavv+1QhGo5GnnnqKKU/cT2pKMqagYNJTkjGYAujS925rcAJNYptB0xbX4CbcrGkOzeNMUXHM/v66rdmrc7L2SkBT3L392mS63XkPCyZFM3r6PLs0nDPlXX5+PuNiYnl35bsEmINJsyQz9MmhzImbbZf6c9aLSs3TVg+2ogdJ2RSvcLHmsLpRIgkHMjMzOXz4MHn5eQSFhTN/7XYWrd/F/LXbCQ4L57ed20qYZwaE1KNJqzYupeTadpXUvHqxlagrs9jS0dpwfP/NVxw/tJ8RfTozvGdHYgb2tLbjcCQ/P58uXbuxadf3zF2znfnrdzF3zXY27fqeLl27WcUP2ujq8cGPM3f8CLvraO74EQx9aqjKNtQAroyXNSw2z3tW587pGfTV9kdVmMVqBrGmgEDZoEmE9PT2tprBao/lO36Vnt7e0svHRwaH15d+/ibZ64FBct7a7dIYECgbNomQhw4dcrrdRhHNlVHmJUApJpauXgek2dm6Cl3y8vLkqNFjpNFkkg2aNJUGo78cOWp0iXM2IyND/vLLL/KRQYOkt4+v7vXi5eMjH33scTly1Ojia6CZ9PH1kwHBIXaGs/WbNpNPDh0qMzIyauioL39wOP/NTgxkbZ/Xu648nK3n71+RfdM1i71iU3yOuXJNXhu3eiuWc2f595hndPPkQaHhBASHkHj8GAX5+ezf8wM/bPwGdw9PUlOSS6QobLdrmyqZMHGSUixVEo55ddvnU5y8pnDOhImT+O7Xvcxds133nM3Pz2f8hAksW7oMbz8DmWkWTOYgsrMy8cky4mswUJCfz1fvLgUJ32zYSHZmBj3ufYjIl14jPSXZ2nW3/2NDrfO7I+7owmeffc4zzzyjvCurAQtF14ijnNzCxfS5oKTXpSZLd0QvDV9RrjiZuVaIuGLFCkzmINIsydwz8B6++PILZn+xmbWrVrDpfx84NcUcO6An0z9aS+x9tzP7803Uj2hWXE0fhZ8b/PzTj9blXbWPV0aZZaesLuau5LKOeAB5dfDcr2rKcs5OefkVNu78jrGzFmMKCmbpK5PY8dWnBIcV+VT2fvBRZKHk+KH9jJkx3xrkFkyKpmmra3lywhTrtbRky0/WOd2RfbsyatocPntzjpKcVwG2Bq+287dlkZg7omfQXBHVrDKLLcZ2RDNv3U7mrN7Gnn0HkUKwdtUKjh/cz/y12+n/+FPMmzC6RAPCXg8Mon5EMwJDQpGyECgaWY2fu4yjR4/YzTGVRbGkqB6UaWzZKO2cPXLkCMtXLGfsrMWYQ8NYFTeNcwmneHPTbhZt/J55a7bx974/2PDR+9bgpK0/evo8tnz6X7KzsopaxweasSQlAsWdd1MtRLRqo6zBqgjNDNbs71+m81/PNsyMfhFvVXFFBSitBkMrEAQtuCzlQnY2m/73gVW1NDhmMo1bXMNzfTozok8Xxg7oSdNW1zI4ZrK1lbs5NNy6bXNoGAGBQXZBx1X7eKVYqhzK6r+nKBulnbMApsAiR/PsrCw2f/Kh9ZqBoutg0Jjn8TUYdIOcFpQsSWfIsKRgDg23u/nT1lM3cFWHrRO5K8riLlHVXFEBytXdoX9AID6+ftbX3D08iHzxX/R5eDAGfxPTP1rLkxOmkJ6SzKyxUXTtP7CE3Nwx6CijzMrBlQOz7cWjqDilnbMtWrQgPbXIndySlKhb0BveJIKsjHTdIJeZakEIN+bEjqCwsJBRfW8h+q7u1ps/bTl1A6eAKyxAubo7vJCTQ1ZGWonX7nlqOIkn4pn4UH+eva09z/XpzPHDB4k/tN/uAl70QjT3339fiffUZLuxA3sxtl83Ygf2cirXVZQPxzSFM7lsTdVw1FVcnbNGo5FnIp9h7vgR5ObkkHbuLKfjjwFQkJ/PyplTGXPnbXj5+JRoSTNrbBQgGXdPL3LOZ9GkxTU0ql+fq69ra/W3VDdwdYeKtOwo83tcaSKJmNjxbPvhJ2sLd9vJ26z0dP75+4hdO+tZY6M4+ddhLmRn4x8YxCvvfET9ps1Y8dpLbPvyf5gCg8hMT0UIN8zBIWSkWoiMLNlBt7QKe4VzNJGEFmycefE5a1Nd2RO6VwrOztmcnBxu7d6dvb/txWAK4HxmOj3vfRhPHx9OHDrAmBnzMQUFW68RX4OR7KxMbuk7gFv63U1QWH3e+tdkjh86wIn4Y0x7Yzpvv/12CWswpeKrGlx5W2ponadLrMvFAl9dT79KFknUeE3TpTwqUgel1SX5GgwyKCxcGkwBcuDTw+XSLT/JFm1vlD5+ftLb11eGNW4qffwMMqxxU/ni0lXSz99krfNY9fMRueCbb+XszzdKb19feV2nLtbXlu/4Vbbv1l2Oi4m95H1U2ENx/YVWn+FYB+WqnkM6e60CNRtXChkZGfLQoUMlapPGxcTK9t26253zrdrfJL19/UrUQi1ct0u6e3rK7vc+JA2mAGvdU99BQ6Snt7f85ZdfXL6XovKxvYYcawNta5xc1RZ66NRDVUUd1BWV4oOL1v8Jp04x8M7+uAvBr1s2MOH+O/CWBfga/On94GOkJZ8jPz+P1//zBeFNmmIKCsYUFMzKmVMZ1rMjrw8fwv8NeQA3d3cGPjOyhPGlUiFVHmZ/fwQX7/SCuCiK0J4vbULX8cRXjubOyc/PJyZ2PA0bNS7RDsO50GgZUhba9ULLzsoqes7PD8uZROat2cai9buYt2YbiSfi8fLxsS6rWYOp7ELVo11Peik6LW3uqfO6lsGQFI2u8qHKr6krLkBpBAYG8s7bb3Po4AHeWrKYgwf28+2O7eRkZXL/s6N4fsEKfA1GfAxGzKHhZFhSWPHaSxw/uJ+4LzbRqXdfAHz9jMyOjmLlzKkUFFu6KBVS5aLJY2Vx+kALSLZzS866fyrKj2MpRtzqrWz/cQ8TJk5yKTTSeqFpc1HDenbkX888Rm52jq7kPDcnh/r169fEIV7RaIFES8ZpN3R5XLyu9NJ7eqnyquaKm4PScCzYTbekEBkZSX5+Pp98+RVpKcl4eXuTe+ECtz/0GLkXLrDl0/+yeMN3fPXuUo4f3G9npDl/UjQRNkWIMQN78s/Jk+qOsJLR2lE7y4NDyTy5mm8qO6UV6h48sJ9rWrXWfX1Uv25cff0NNGx2FYkn4hk9fR7ZWZm8HjWYRRu/L/Few3t2ZOe2rTRo0KAo8JlMpKenq3naKsK2UNcWx8BTlv5Q1r8r6bpS/aAccGZB5JabTXBYOP96/7OLwWfiGILDG+AfaMbHYGTzJx/aNXMzh4YxZvo8xg7oyR0PP8Gil2Jo2fIadZFVI46TvcLmeUXZKa1QNz093SpDt204OHvccG7pPwAfPwMbP15ldWHxyTKSkZaq616elpLMoMce58jhw3h4e5OVkU5gUAgXzmcSGansjiobrf7JkdpcQ1gpKT4hRD8hxCEhxFEhxCSd1x8XQvxe/PhOCNHO5rV4IcQfQojfhBBV1ybXBmd59KdfmsbevXutVfIAPgYjD4+K5du1n5NuSSb+4D7d2g9zaBieXt5MeLAfTa6+hsOHDqk5qCrA7O+v+7yreSgPUA7mZaQsxeWaDD1mQE+G9+zIiD6d+eevw+xY/Rk713yGwd9kvT58DQZ6P/goCyZFl3Bl6XnfI+QId8KaRNC0ZWve3LibJVt/Im71NmtKUXFlU+EAJYRwBxYB/YFrgUeFENc6LHYM6C6lbAv8C1jm8HpPKeUNekO8qsDZXaKUhRhNgZhDw+zy6Asnj8NNuBEUFMx7M6eSnpKsewHn5mQz/5tviXplOgHmIDUHVcnYpihsRRKljZK03LpeekNhT3mKy/Py8/E3BxP35Rbe2b2PJZt+oEGzq8lKty/SHRwzmfAmESVcWSJfeo3YuCWcOHqIZ6e8UUJotOLtFeomr5LRBEaODixlzTRUdd2TI5UxguoEHJVS/i2lzAX+C9xju4CU8jsppZaB2Q00qoT3vWQc7xKzs7JIOPYXuTk5ZBanI1bFTeP4wf1W5dHijd9jbtCoSAghJbPGDitxR9j7occw1wtVlfBVhK1Fy6Xar6j+UKVTWnH5hImT2PL9Dwg3NyYveY/6Ec2Ai2o+N3d34mIv9ntKT0nm9PFj+BqMvLRsFUu2/MSTE6bg7uGBOTQMgynA6mupYQ4Nw9vXoG7yKhlXmYbyWob5O8lmVCaVEQQbAidt/j4F3Oxi+UjgG5u/JbBBCCGBpVJKx9EVAEKIKCAKoEmTJhXaYe0ucdEL0YQ2jmDn11/iHxCI5WwS7h4ezBk/kmMH/mT+2u32Uto5Sxk7oCdzVm/l40Wzib6rO57ePmRnZdDjnoesPn2qEr72opeHDyp+XhNgaJj9/a9IObpWijH11VdKFOpq6fEJi99l4eRxuqluHz8/AoKDGdWvGz5+BrIy0rhtwAMc/XMvvgZjCYuwrPQ0hLC/V9bmqEzqxqHasC3e1XUwr4HroTJGUHrHoivtEEL0pChATbR5uquUsj1FKcKRQojb9NaVUi6TUnaUUnasV69eRfeZmTOm4553geOHDxSNkjZ+z6IN39H0mmvJSk/Dy9vHqdllbk42I16bzdJtP+Pu6Unnvnex65uvGNbzJsYN6KGsjGoZpd0ZOr2rvMJTgnq1SVp6PKJVGzKczFVlZ2aSGP83r//nC9p2vQ1zvTDO/nOSW++6t8Rc1KyxUXh6evHW1BdKZCQM/v6kX4E3CFWFs/lbDe060TISZn//Gq8drIwR1Cmgsc3fjYAS43IhRFtgOdBfSpmsPS+lTCj+mSSE+JyilOGOStgvl+Tk5HD40CHiVm/Fx2Ak4dhfmEPDeX7eMsbc1R1AV3mUmWqxupjnZGVy4XwWz/7fGzw0fBwTH+zH+nXf0K5dO6U+qmacNS10tGypzYqluoCWHs/JyrSKH2zLLeJin0O4CTLTUpn4yF0IIVj4zU7WrlrB5k8+xMPTk+f6dMbP6M+FnGxkYSHCzZ3wJhGMHdATY6CZzFQLXfsP5O99e1WavJJwJjG3pTaWYlTGt+hPQAshRDPgH2AQ8JjtAkKIJsBnwBNSysM2zxsANyllRvHvdwBTK2GfSiUhIQH/QDNfvbuUzZ98iL85iAxLCr0ffBRvX1+u63RLiYtvdsxwuvYfCMCBn3/kw3kz6fXAILLSUpk2bDCFhYUMGjzEWlOlZLLVh1bHoQUgrSo+HxWUKhNbEcWwqbNYu2oF0Xf3wNvHl8z0NAJD6jHzk3W4ubmR9M8pFr04jpAGDXlywhQeHhnLudOn+OY/77Lls/9iDgkl9dxZmra8htPHjzH9o7VIWYgQbix6cRwPPvCASpNXElpq29W1EGQy2Y2SnNZNVWOqr1IKdYUQdwJzAXfgbSnl60KI4QBSyiVCiOXAA8Dx4lXypZQdhRDNgc+Ln/MAPpBSvl7a+1VGoW5mZiZh9Rtw9fU32HX9nD9xDIf3/oKnhwceXl5kZWQQEBTMhewsWrZoyf4DB8jPy8PP30RWRhoGoz+5Fy7Q/NrrrPJ0bR5KdQWtXDyF0G20Zlto6Il+M0Lb9u+6Jpc666gCX32sRe5vr8Db10Bq8jl8/PzIysygz4OP8e3aL4quj/Q0CgsKWLR+lzUTsXLmVOIP7CtxzZ3PyiDh77/wCwgkK9WCEIJ/Tp4gMDCwxPsr4+XyoxnEujJUtnDxfNeCU3VdF84Kda9YJ4nMzExCw8JZsG5niTTe6P7d+PvoUdLT0+2q26e8/EoJJ/R5z4/i6B+/smDdrhLbGTegB4cPHiQ8PFxvFxTlxNaFuazV7o7POwtgKkCVHy1QmEwm9u7dyyOPP46f0URaSjL+5iDSk89RWFhIs9ZtGD93GT4GI1E9OtiJj6DoWnmuT2er67nRFEhOVibDhw+3y0I4c39RmYrSKfO1U3y+2zq2uFquEvdPtXy3JSEhgeDQMGtn0IRjf1lbUQfXCyM9PZ2WLVsSHh5Oy5YtAVixYoU1OEGRaOKRMc/j7WfUFVR4ePlwdYuWVqNNReXhrPdTaeRTUgzhbFvqK881mogiPDycdu3akXP+PPUaNGLm/76hzU2dad6mLfPXbqflDR0Yc1d3hvfuhJe3t75LhTmYBk2b8ebG3Sz/9lcWrNtZoljXlUegomQJhafN71D6NaIVtDuqWWuSKzZANWjQgLSUZJa9MsnqTj6sZ0eWvTKJdEtKiclZZ8W9Ea3acN5J99C83AvM+GSduoiqgBTsO+lqiqNL3RaUDFzqlqLsnD59msKCAuo3bcbEh/rzy/YtHDvwJ2veW05hYSECiiTn6frXSrolmVEON3+2XQGcub+ozgEXcawT1LsZc3WFOC7viF0XgWqqIbxiA5TRaOSaVq04cfSQXRuAE0cP0fKaawA4fPiw9cR3ZgGTk5WJp5cXi14oaeXS64FB1I9opi6iSsJZmwAzShJeG/Dy9SXxRDzz1m5n4fpdTF7yHkd+/5Wft25k3trtLN3yI73uf4TZMcPtrpX5E8fgZ/S3Fvxq2HYFKM0jUBX0lo9LcZSoiXKMKzaLkZmZyeFDB4lbbW/6Ghu3hNH9u9GgYSMCgoJJt6QwZMgQnhs+jCFDhpQwyVw8eSzPREbi5ubGuAE98PDyIS/3Ar0eGMTgmMnW7WoXkZYuVJSflPR0u1y6LWVJStSexMXlR/369cnNzmHEa7PtlLHpKclIKfH0Lur9FPnSa6x47SWe69MZg3+AXRGvXlmHrSOLdoPoahmFa7QApAUbRzTrsBScl27oIYSoEnXfFTuCKrojC9a9I/M1+jPxzZXWPPemXd/TtXsPVq5ciVtuDjEDe9pZwMz690ziZs/i8MGD5GafZ/pHa61WLqAuokvBmSWR3h2VdtFpr+uNsjxxnd5QVIz09HQCg0NYu2qFnUXY/LXbadqyNe/9u6h6xN3Dg6hXpuNn9Oeqa9vS+8HHnBbx2jqylMcjUFESbcTk6Pgf5LAMXLQ90pYtyyimqkZSV2yACg0NJTkpUTcffiE7m4hWbYCigDVq2lxyc3J4ZeUnFHr58MTgJ9i0/htOnTxB3OxZVgVReHg4UVFRvP3aZHURVRBnvnt680IWSs+7a8W6zgKY3vOlVd4rLtKgQQNysjLY9L8PrLWDoPnzLWX3hq/JzsoCiq6J/Lw8LlzIxtPTi4jWbfh27Rcc3beXEX06M7znTcQM7FnCkaU0j0CFc1x58JW2TI3Oxer1ga/tjw4dOrjsb18aeXl5smOnm6WPr59s1b6TXL7jV/npwQS5fMevslX7TnLg08PlpwcT5Md/npADnx4uDaYAaa4XJv38/WXfQUOkKSBQZmRkON32uJhYGRBolo0jmsuAQLMcFxMr8/LyKrTPVxqAlDoPvecp5XXpcM2ZHZdXVApDhg6VQaHh8tODCSUeYY2bygXffCuX7/hVtrvlNjnw6eFy+Y5fpTEgUK76+Yhc9fMRGd6wkdy5c6c8dOiQ0+tLSikzMjJKXeZKxOzvb3eeezic96VdT2W5dlxtoyLXErBH6nzXX5FzUONiYsnML2Te2u12lfBZGUWtxcfNWgRg52iuzTktmBSNu6en0/kkV0abisrhUueSbOtAbHGsoFdcGgY/g7UbgOM8UUpSIv+KGsz59DTr/Ky7hwfGQDOWpMTiGqgs2rVrV+r1osnbFfa4OocrQzquZRkcMVd4y8654gp1MzMzqd+wIXPXXCwWzM7KIv7gPl59+hGklISE1Sfy/15nbuwI5ukUFY7o05nj8fGqALcKcSWGsD1n9exYHNcrrXreA8irg9dBbUJrFd+l/0BOn4hnjI1F2KyxUZw4fIDJS1cR0aqN1c3cknSGsQN6Mv2jtbz92mTlvFJJOLMoKou4yNkytk4TZb02y4Mq1C0mISGBAAdxhK/BQOsOnQgOb8Ar73xMYL1Q5sSOwNNJUWFAULByWa4lOM5V6S4DJfLqWv7djKp3qgw0GXjkS68R0epaxg7oyci+XRk7oCdnTsTz0IMP8tmbc8jJKiq1uOhk7smLg+5Wc0mViN78bUVJsfndablHFczZXnEpvgYNGpBmSXbqVB7RqsiWJfruHlzIydFd7sL580qRV8WY/f0RTowqXa5H+VKAmj+fomJodYLpKclWY1hLUiJCuPHioLuZO2cOU//1GrEDe2EKNJOeauGJJ57gufffpUmTJnZpPeW1V/k4uy5sfSyd9oFy+Ls60+FX3AjKaDQy9MmhzBobVdS7JiuLAz//yNznR9HrgUH4GgyYQ8PwNwfRpe9dzJ84xk6RN3f8CCIjI+0auNkW9Coqh5T0dF2BTGkXh+YwoaESd9WDowzc12DA12Dk7dcmM3jwYJKSkpj66iucOnnCqoBdMH8e1157rfVays/PJyZ2PA0bNaZ33/40bNRY2YRVED15uYbEfmRk66hiiyY7rxFVq96XQG1/VFTFZ7FY5PVt20pvXz/p6e0tA0NCpY+fn+z/+NNy7pptcuG6XdIYECiXbdsjez0wSHp6FS3jazDI6HHjZF5enlWt528KkGENG0l/U4BS69UAuFAW2aqYqkp9pLiIo4LVPyBAtm13g/QPCJCNIppLU0Cgy2tkXEysbN+tu52qtn237nJcTGy1HkddB51zvCxq2Jq8JnCi4qvxYHMpj0sNUNoFZAoIlAHBIbLVjTfJ5Tt+lR//eUL2HTREevn4yuDwBtLLx1eGNGgk/fxNMrxxU+np5S1bt7lOWiwW67aix46TDZo2K1qm+GeDps1k9Nhxl7RvikujtAvNXEb5LCDN/v41fTiXBSdPnpT33X+/9DMYZasOncoUcDIyMqQpINC6rPZYvuNXGRBoVpLyclCWAOUoQdceHpS8uauOa8NZgLqiUnyaG/K0j9aSl3uB8fOWYQ4NY1XcNBJPxLN4w3cs27aHxRu+o16Dhtx6170s2vg9b27ajfT25f+mvAwUpfXeWr6ckAaNmL92u7ViPqRBI5avWKHSfdWIs8JbLW+uCSGcLWMnmlB+fhVCS9G1vKYV277dSV5+HuPnLC1h7rri7RX8+uuvdteJ8tqrPGxFDM7QkqaO80taoXsetePauGIClK0bspSFVpuj7KwsNn/yYYnq99i4Jez65issZ5PIzspk1LS5rHxvJZmZmRw5coT8vDxr0zVtnTEz5pOXl8uRI0dq8lCvKPwd8uLaBeWYW9dTNaWgqEwmTJzEth9+YsG6nUx971OCQ8N1A46bhxf9Bwy0m2NyZsasbMLKjzZ/WxqadLw2c8Wo+Gzv0HyyjGSkWopFEpn4O7lz8/HzY3S/bgSE1CPDkoKnlycnTpwAwM/fhI/BSMKxvzCHhlvFFX7GqrOeV5TE1kBWqfFqDu0GMG711ovXmJOi3bzcCyxcv4vUs0ksfnEsueNiWLhgvlVk4WjGrGzCrlyumBGU7R2ar8FA7wcfZf6kaIRwI8PJnVtGqoV/f7re2oojvEkz3lyylGbNmpFzPouoHh2sfaRWzpzKuYR/uJCdRYsWLWroKBWKmsExRaddY44GsPMnjqHn/Y/w8aLZTHz4TlKSklj21jJGj4lm2uuvKa+9SsQD1ylwvfrA2sYV5SQREzue7T/uYcS0uZiCglnx2kts+/J/eHn70LD51Yyfu8yu+r1py9ZEvXLx4rAknSF2YC+eeOIJtv3wE2NnLbYuP3/iGM6cOsGgB+9nblxcZR6uohS0EZQzxwinVe96z9XB66E2kJiYSItrrrFzaCnIz2fFay+x5fOP8A8IJDMtjX6PD0UWSk4cPmBNq1uSzrBwUjQ9O99E3OxZqg6qknDW5r00ZxWn10sVXhvOnCSuqACVn5/PuJhYli1bSmC9MFLPJXH7Q4/T7a57+Vfko7i5ueHp7UPuhWzc3T1YsXOvtWWGxpg7biEtNcXuQoSi4DW6fzcSTp0iMDCwooeoKAfOrF3MQAb2ThEe6DtHeFA0n6U8+cpHfn4+EyZOYsWKFbh5ehLeJMLhRm8Yxw78QWF+Pgg35ny1hYkP32n1t9TQbv5OnTyhglIl4SxA6d2cac87m5eqil5Pdu+trI6KjFxHjxpJSFh9np+/HA9PL+5/dhTX3NCBPo8Mptm113Ndp1toek0bCgoKSE9JtlvfknSGNEsKgUH6faRC6oWRlJRUnYek4GJlu2O6IoViVZKNbNVZO4587CvknfWjqsr21nURTRkbt3ory7b9TNOWrRnZ9xaiurdndP9byc7KoGW7Dry5+Uf6PDKY2WOjMAYEKrVeNaCl9zyL/y5N2Qc2xboOcu+aunG7ogIUFM1FZaalYg6px+0PPcb84hz54JjJNGx2FT9uWcfJIwcpyM9j1thhJVwknnzySdItFqU2qmVUtj+Ys35USop+EVtlrDk0DHcPDyJfeo1ud91HeqoFU1AwCfF/U79pM754+022f/kJ6ZZkUs4kWl1cEo79RXZWlrp+qgDtZsz2pkyTlTteJ65avdckV4yKT0OzZFn0QjQhDZtw5PdfGXFHF/yM/mRlpHNL/4H8sPFr/vX+56z/70qi7+6BvzmIDEsKSMm367/Gy8tLqY1qGa7u8JylABUVQ692aVXcNM4lnOLNjbut10Zc7AgsSYnWtN6SKROY/Og9ZKanYgoKJj0lmcCgYIYOHaqunzLgNKVdhjScqzkm7WdtatRZKSMoIUQ/IcQhIcRRIcQkndeFEGJ+8eu/CyHal3XdqmDmjOm4513g5NFDLPh6B2/v+oPx894iODSchL+OEBgUQoNmV3FvZFG7jReXvMfSrXsICg4hKSlJdfasY9iOhhSVh2PtkrOawpjZi0m3pOBjKAo+Hp5ehNRvYFfkHhQWruoEyoir0b1tOvpSqMl0nh4VFkkIIdyBw0Af4BTwE/ColHK/zTJ3AqOBO4GbgXlSypvLsq4eFekHBRd712g1G0UphkQKCwuZ9PBdFBYW4O7hiSkomAxLCr0ffJS7Bkcy4f477CZxldqobuCo8nN1B6ndhVZFz5vLEVtlbHZWJq9HDWbRxu9LLDeyb1cmLXqHDR+9z8aPV1lHWBpKJFF2XJ2bziiTQKIGRULORBKVkeLrBByVUv5d/Eb/Be4BbIPMPcB7xZ5Lu4UQgUKI+kBEGdatdLTUhCkomJUzp7L5kw+taTx3D3eat7yOcbPftJOQv/rUwyVSeKqzZ+3GNhWiXbweFE0a6yn5zKg5pvIyc8Z0xsXEMqJPZwJCQkk9l6RbnJthSWHDR+/z9597nTpMaCIJdU1dOs5UeEGU7pxSm0ZOGpWR4msInLT5+1Txc2VZpizrVjpaamLFay9ZW7ovWr+LqSs/4UJ2Ns/9a3YJCyPLuSSm/N9LVb1rikpELxWSB06VfNqFraVHtAlkTyouurhcsVXGTln+AV37DywhLpoTOwJjQCDbvvgfo96YZ3WYsEWJJCoHveJb7dyuixnUyghQesetVwOpt0xZ1i3agBBRQog9Qog9Z8+eLecu2mM0GouKbb/4H6Onz7OOpP5vyP0YA8xMfPhOVs6cSkFxHxolIa+7aP1wbAONK5xJ0Gtablub0ZSxvgYjT70wleOHDxB9dw9rR93mbdpydZt2eHl7Uz+ima7DxMJJ0UpkVMVInDtL1Fa1XGUEqFNAY5u/GwGOxQzOlinLugBIKZdJKTtKKTvWq1evwjs94rnhVgXSqrhpHD+4n/lrd7D821+Zt2Ybxw/uZ1XcNKDk3Z1qUlh3sL2j1EZOzqhrF29twbZZYd6FHPo8/DgRrdowatoclmz5iXueHs7Z06fIyki3lnQ0LW4LP6JPF0b06UK3DjcokVEZqch56qoOsFai14OjPA+K/i9/A80AL2Av0MZhmbuAbyj6P3YGfizrunqPijYslPJi/5mF63ZJgylAtw+Nsfh1rYeNbT+psjRgU9QsOPTDweGn48Pl8wqX2DYrbNS0mTT4m6SvwSDr1W8ovX19Zd9BQ+SAocNk2y63Wq+1het2yWs73CRHjhpd07tfp9A7T8G+95kZ3TgkPWrpOY6TflCVYnVUrNKbC7gDb0spXxdCDC8OgEtEUVJ/IdAPOA88JaXc42zd0t6voio+jZjY8azfvoOUpCRd5dEzt95IXk42zz77LDNnTLdWzTvWP3Xv1JG42bMqvD+KykWbS3JUMLlSMinlXsWwVbZmZmZydYuWzPhkHfUjmlGQn8+quGls/uRDPL28Kci9wJNPPsmcuNl4eKhxalnRU/HZpq5LO8dt7b7MFIknavocV158OuTn5zN2XAzL3lqmK3sdN6AHhw8eJDw8vIQ03XY5JY+tnWgqPu0M19R7zvz4QAWoyuTw4cP07tufeet2Wp8ryM/n3RmvsvG/7xMSHk5WejqRkZHMnDFdBaky4ipA2Z7bLssp0Ff7aetXt+RcefHp4OHhwcIF8xkWNYyFDpO2iyeP5ZnIZwgPDwdUx8+6iOMF5qxbKOhfzIqKodeEcFXcNE4dPcybm39g4cbdxK3eyvYf9zBhYrXU6F8W6Nl6aZQ2z6rhTO2nrV9byi2u6AClMSduNj073+TSGUJ1/Kyb2E4oVwRlFFt+bMUTmvfepv99UKIT9Yhpc3n77beV6KiMaB1zpZSXfdmDClAUjaTiZs/i1MkTbFr/DadOniBu9iy7lIPjxQYoD746gK1qyRVabl5PHeWJvZWMbaBSrueusbUFGz+wJ17ePioLUYk41vpdbqgAZYPmDOEs2CgPvrqHbTrEFVrqT/vdsbjXdjnb9IdyPXeN7c3f16u/oiDvgspClEJl3PSYcd5Jty5xRYskLhXlwVe3sLU8cjZx7LRRGxfz9ZoyypVVkmYno4QV+th69yklrD7l8YF0XNZZt1xb8YT2uysla3Wfv1XpxXfFoTz46haaWCLIZEI46byrBSFH9EZeri5uhWu0co3Ygb0wBZpJT7Xw9NNPqyxEJaF3gySEsMsCwEWHFUe0NHdtmdtSIyjFFYcQQnfEVB5Zbml1VGoE5RqVhXBORUZQesvWBWd+JTNXXJHo5fM1yjO5bCvLVZQdZ7Zgpc33KsqG7RyrrakxYD3f63KaTAUoxWWNUxED+nUkepTnAleu50Xk5+cTEzueho0a07tvfxo2akxM7Hjy82ut61udQrvxsp1brXM+e2VABSjFFYuW3tBGRbbO57aBqzyhRtvWle56rtmCxa3eyrx1O1VBbjnQK8TVbnocA5NGUPXvZrWg5qAUlzWl5d81hV9pyiawD2TOlFICSkxIQ812K61ulC1Y5WGrQNWQOD8HXb3maPFVm85JNQelUDhge/FrF66g9LvRFEraJGn1Uo42SldiXZSyBas8nBXiOjubtODk1MbIxim8tgQnV6gApbhicTU/5Ygzl4m6VvhYHShbsKqntK7QlwsqQCkua1zl88uD5vCs3X3CRel5WcUWVwrKFkxRWdRlBaJCUSqu0hi2kvPSkFCiyFevuFcFqSJUQW750ZtvgovOJZ7Vvkc1jxJJKK5YXAkobNFr6ubYDNF23dpeFFmdqILc0imLFZdmP+SJvWek3rIut1NLz0FldaRQlANb7z0NDy4GJmdNDzWTTkds14XapaCqSpQtWOloc6FlGX3rKUQdcXUO1jXq4j4rFJWCJnxwxJnwwVGGrreudaTl8Lzjc3qegApFWdDOT2fnrp0fn81rdTH9rEQSiisWLeA4BiRN+KAUeorKoDJ7hmlKvRSbv21HGbaincth9HE5HINCUSFSsC9u1NJ32sXujMsplaKoOrQUniMVHUXbZgAuV3d9dS0pFFy8IwXnF73jBe+YSrH9qVC4QisG1+YlbTs6O6I976ji0zIAl/P5plJ8CkUVYtt3x7ZeShNNqPbxVyZ6bg+OAgizw/O5l/A+Fan9qw2oEZTiisXs76/fwNDfX7cexVlKz3auynpnrLOM3ijNkpFhpwysilSQom7heA7onVOuRlx226qlsvKyogKU4oqlvEW8mkLPUWKuzVV5Ur7OvKrIV1EWynNOXW5UKMUnhAgSQmwUQhwp/llC+CSEaCyE2CqEOCCE2CeEiLZ57RUhxD9CiN+KH3dWZH8UisrCmfeeret5aSma0tC26WhO69j2A0qmAxW1D2cpWttGgpVtiaWpUCvDzqs2UtE5qEnAZillC2Bz8d+O5AOxUsrWQGdgpBDiWpvX50gpbyh+fF3B/VEoKoXqaP7mzODTmRv1leSIXhdxZj5s6/zg6EpeGWju+lD33MpLo6IB6h5gZfHvK4F7HReQUp6WUv5S/HsGcABoWMH3VSgUijqLs1FPVY626iIVDVBhUsrTUBSIgFBXCwshIoAbgR9snh4lhPhdCPG2XorQZt0oIcQeIcSes2fPVnC3FYrKR5vQdfyCCdJ53tmJbjv5rZdeVFQ/rtSVtq+VB7tRj83PXKpupFUXKTVACSE2CSH+1HncU543EkIYgU+BsVJKbez5JnAVcANwGpjtbH0p5TIpZUcpZcd69eqV560VimrBVY8eLbhoz9kq+hzVWNWRXlSUHad9wzIy7F67VEoLbU7nmCrwnnWFUm/KpJS3O3tNCHFGCFFfSnlaCFEfSHKynCdFwek/UsrPbLZ9xmaZt4A15dl5haIm0NIw5cFVcCmLx5+ibuPKdcQf140GUxz+1lxPHJ1OLsd2HBVN8X0FPFn8+5PAl44LiKKx7wrggJQyzuG1+jZ/3gf8WcH9USgqBVeNDnNtJqK1OpPS7qAd1Va2cw3Y/F5au3lF7cXVSMfWO89RQKGNsPXW1cO6LYfzMLeO1zzpUdEANR3oI4Q4AvQp/hshRAMhhKbI6wo8AfTSkZPPFEL8IYT4HegJjKvg/igUlUJKenqJLwBnyigtmJW6TezTdc6MahV1E9vPV8OM/UhH70bEWesWDyqvI3RdRTUsVCgqCccGiLYGtLaYgQyc95Mqa2HmldJTqqpx1slW+//qNbZ09dlqBd0arpoQOv5eYpk6+P18KaiGhQpFJeDqy0zPOslVoHH2mrP5ihJ2SaouqlIozW3cmSWWs8/PVsCgRsQVQwUohaIc6H2ZBaFfRHupF5etpVIezu+wFdWD3ihVk5U7G0llUiQZV4KXiqHczBWKCuLM+aGisnDb9R3tj7QvPmV/VLM4++yd2V7pudtrzytKogKUQlGD6PruUXL0peyPag9eNj2cyovt52h2eF4pOUuiUnwKRQ1yJTtV11XysBc4OCMI1y0xXH32V5JSzxVqBKVQVCF6Xmu2r2k/Xd01l9UxQDU8vDT0/O8cP6tLwUJJo9iycrmYvVYUFaAUinKgV5dSlnU0XNkhOXuv0pRgWiByZcmjcSlB7HIPfNqIqKzzSOVBryhbUXZUgFIoyoFWwAv2cwnOzF09ufR5Iu29Skv1lOfuvCxBrDLWuVJw5R5hG+iU+euloQKUQlFBHB0ENNeJPClL3J2XFQ/sR0ZQ8m78SjEMrSiljQBduceXNlLUc48Ae9GD7US/bUBTlI4SSSgUl4Cz4s2KTGw7fmm5ciBwVRulV5sjhLhiJ91LK8R1NOu1vl78vOPn7Mws2JkLiO2yti4TTrdzhX5OeqgApVBcAlU1gV0WdZituMIT+7kSly4VV1hKzpnrR0XJlVLX/gicf26eXKxrUxZVZUcFKIWiGnFmY+TMMNTZNrQRUp7D844+cFcazoJSECXbVpQVLdVXEfK5cnz1KhM1B6VQVCO2nVRtKc/Xn9Oi3TKseynu2LXdUdt2jsmpoMNxneKfpQUegb3y0tmcn55DhCq6rTgqQCkUVYirL3fbNh6VIWkuC7aji7KmT8rTesQW28DhqSNSqCyp+qV0tbUN8q7QEz/oKTcrctOgcI4KUApFFaL35W7298eSkWH3Re2IK/lyRbBd3zalmFEcuFwp3spbD2UbOJzWf1XjvJiz2rVL+V/bqvdKC3J627nca8sqCzUHpVBUM3qqMu0L0dX8VFkEFNryzrZTmrWS475Z24s7BBJtvguqX3xxqeIHvf+fo+KxKmaJ9OYFS1MWKopQIyiFogaxnbuw/aLUCj2hKDiVtX7Gcc7E9lFed3Xty7uqU1d6aUBXIwnHeaayoifrLmuqzxE9k19n1Kb5urqGGkEpFDVIaSMaDW20ogUNx9c9KJ8SsKz7Vl4cRzdlEX/Yjmysv1/iSMLZKNSxRslWwGC7vDaStVX86Y1IHT83V8ep1HuXjhpBKRS1FL15EGcjmnwuWupUdP7KdlSn7UdpirTSVHRVie3x2o1Ci7scY/O8tpwnrkeHtv83fxtBiyZ6cbUPtVHpWFdRIyiFooZx1pVV+xJ11bJBD7u28JT9Tl+jrKM6W0qbH3N2DI5B0/Z/YSsecVXcap0Lo+RopbSCWu2n46gJLjp22L6v9rujsMVuH5zsi6L8qBGUQlHN2ErPwXVXVs3TDyrHd6+0O/3y3PHbHkNp9T6OdkLaMToGBaejGhvVoyvvvPJQnjk1R9WdK8qyL7W9tqy2oEZQCkU1Y3tH7vhl5zia0l73oHKECY53+trv2j6lpKe7/ALWXvGgpMVSWcQC2rq2OBtBOsOVd57jqOtSsRVUaNssz0jUv5T3VlZHZUONoBSKGsTxS9TVHNOl4Gy0UZHRmPkS98f2WGz3xfaYXY2OtPm1srxHRWusSnOhdzYS1YKvCkCVgwpQCkUNUt4vMmdfjI5tOLS/HQOJGf3UWlmxTYfZSq01yhoQNdGBI3oSeS0g2oognIk2bN+zvDgGGVdpS2eWVSpBV7lUKMUnhAgCPgIigHjgYSllidG6ECIeyAAKgHwpZcfyrK9QKIrQCywCyMXeMdtVGqw0yipoKK+YQkvl6ZmvuhrVled9HJdz1RqjtHXLGuQudT1F6VR0BDUJ2CylbAFsLv7bGT2llDdowekS1lcoLkscRROXiuaYXdGJdtvgZjuS0QuOzkYZetJtLdCUp+39pZqtaqM7xxGkdkfuKCWvSMpTCR2qjooGqHuAlcW/rwTureb1FYo6j61fnyucKr90tueKyvhCNXOx+LWiajpXlJZOcfZ/cDWXp1ejVVrK01lq1dH0tywmuoqyU9EAFSalPA1Q/DPUyXIS2CCE+FkIEXUJ6yOEiBJC7BFC7Dl79mwFd1uhqJ24kh9rgUxPWFGeQFPiy/kSvlBTqHxBx6VQniBzKdgKOaBkQFLBqGop9WZHCLEJCNd56cVyvE9XKWWCECIU2CiEOCil3FGO9ZFSLgOWAXTs2FFVwCkuS8ryhVfRL0VnUnbbYlg9GyAo+sIoSzrMmXS8tJSdszkjZ1Q0LersOPWOUXXCrX5KDVBSytudvSaEOCOEqC+lPC2EqA8kOdlGQvHPJCHE50AnYAdQpvUVCkX5MPv76/rZaWk5XdGBzfKuhAnaSKUirunO0OTdtstr6TW999FqsRztmVzhuJyz/TTj4MqhnMarnYqm+L4Cniz+/UngS8cFhBAGIYS/9jtwB/BnWddXKBTlx7EPFZRdXl7WVOGluqaX1wnCsfeSFlDynSxTGq7qmxzfV1GzVDRATQf6CCGOAH2K/0YI0UAI8XXxMmHATiHEXuBHYK2Ucp2r9RUKReVSHqVgWUUWl4ptESxcrInK0xGJOBUnlPIeztZzDI6K2k2FBDdSymSgt87zCcCdxb//DbQrz/oKhaJycWWvVF5sLZIuBdv1KiLJdlWv5Tj6EeibuFb0f6GoWpQXn0KhKBcV/Uovj8u3q7ks7cvLcc5KpeYuH5TVkUJxhVEWJ21XNVdlmb+5VMdxx/d1hd5cV2mpPcf3d3acevuvCnCrHzWCUiiuMMorZXfWU8mVis8ffZl5eV2+y5uC03Nrx+Y5x/dXsvHajRpBKRSKS8J2BGNbwJqPc+VdTQeEmn5/RflQAUqhUFwytq0tgkwmoPY043N8f89qfXdFZaAClEKhqBS0/kuONVgVsQUqzX+wLHNl2iMPRV1DBSiFQuGSsprUVgUp6em6Iy8L9v6EjkXJSsl3eaBEEgqFwiVlEUxU1/srrizUCEqhUCgUtRIVoBQKRZmpyXSf4spDBSiFQlFmSpsTqg3UFhWhouKoOSiFQlEuavucUG3fP0XZUSMohUKhUNRKVIBSKBQKRa1EBSiFQqFQ1EpUgFIoFApFrUQFKIVCoVDUSlSAUigUCkWtRAUohUKhUNRKRHnaL9cWhBBngeNV/DYhwLkqfo+a4HI8rsvxmEAdV13icjwmqL7jaiqlrOf4ZJ0MUNWBEGKPlLJjTe9HZXM5HtfleEygjqsucTkeE9T8cakUn0KhUChqJSpAKRQKhaJWogKUc5bV9A5UEZfjcV2OxwTquOoSl+MxQQ0fl5qDUigUCkWtRI2gFAqFQlErUQFKoVAoFLUSFaCKEUI8JITYJ4QoFEI4lVUKIfoJIQ4JIY4KISZV5z6WFyFEkBBioxDiSPFP3canQoh4IcQfQojfhBB7qns/y0pp/3tRxPzi138XQrSvif0sL2U4rh5CiLTiz+c3IcSUmtjP8iCEeFsIkSSE+NPJ63X1syrtuOriZ9VYCLFVCHGg+DswWmeZmvm8pJTqUTQP1xq4BtgGdHSyjDvwF9Ac8AL2AtfW9L67OKaZwKTi3ycBM5wsFw+E1PT+lnIspf7vgTuBbyhqoNoZ+KGm97uSjqsHsKam97Wcx3Ub0B7408nrde6zKuNx1cXPqj7Qvvh3f+Bwbbm21AiqGCnlASnloVIW6wQclVL+LaXMBf4L3FP1e3fJ3AOsLP59JXBvze1KhSnL//4e4D1ZxG4gUAhRv7p3tJzUtXOqTEgpdwApLhapi59VWY6rziGlPC2l/KX49wzgANDQYbEa+bxUgCofDYGTNn+fouQHWZsIk1KehqKTEAh1spwENgghfhZCRFXb3pWPsvzv69rnA2Xf5y5CiL1CiG+EEG2qZ9eqlLr4WZWVOvtZCSEigBuBHxxeqpHPy6Oq36A2IYTYBITrvPSilPLLsmxC57ka1em7OqZybKarlDJBCBEKbBRCHCy+U6xNlOV/X+s+nzJQln3+hSKvskwhxJ3AF0CLqt6xKqYuflZloc5+VkIII/ApMFZKme74ss4qVf55XVEBSkp5ewU3cQpobPN3IyChgtusEK6OSQhxRghRX0p5ung4nuRkGwnFP5OEEJ9TlHaqbQGqLP/7Wvf5lIFS99n2y0JK+bUQYrEQIkRKWZfNSeviZ1UqdfWzEkJ4UhSc/iOl/ExnkRr5vFSKr3z8BLQQQjQTQngBg4CvanifXPEV8GTx708CJUaJQgiDEMJf+x24A9BVKNUwZfnffwUMKVYcdQbStBRnLabU4xJChAshRPHvnSi6bpOrfU8rl7r4WZVKXfysivd3BXBAShnnZLEa+byuqBGUK4QQ9wELgHrAWiHEb1LKvkKIBsByKeWdUsp8IcQoYD1F6qu3pZT7anC3S2M68LEQIhI4ATwEYHtMQBjwefE15QF8IKVcV0P76xRn/3shxPDi15cAX1OkNjoKnAeeqqn9LStlPK4HgeeEEPlANjBIFkuraitCiA8pUrSFCCFOAS8DnlB3Pyso03HVuc8K6Ao8AfwhhPit+LnJQBOo2c9LWR0pFAqFolaiUnwKhUKhqJWoAKVQKBSKWokKUAqFQqGolagApVAoFIpaiQpQCoVCoaiVqAClUCgUilqJClAKhUKhqJX8P9FX7Ja2oM6HAAAAAElFTkSuQmCC\n",
+ "text/plain": [
+ "
"
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "figure = plt.figure()\n",
+ "axis = figure.add_subplot(111)\n",
+ "axis.scatter(X[y == 0, 0], X[y == 0, 1], \n",
+ " edgecolor='black',\n",
+ " c='lightblue', marker='o', s=40, label='cluster 1')\n",
+ "\n",
+ "axis.scatter(X[y == 1, 0], X[y == 1, 1], \n",
+ " edgecolor='black',\n",
+ " c='red', marker='s', s=40, label='cluster 2')\n",
+ "plt.legend()\n",
+ "plt.tight_layout()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Before we build a KNN classification model, we first have to convert our data to a cuDF representation."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "X_df = cudf.DataFrame()\n",
+ "for column in range(X.shape[1]):\n",
+ " X_df['feature_' + str(column)] = np.ascontiguousarray(X[:, column])\n",
+ "\n",
+ "y_df = cudf.Series(y)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Next, we'll instantiate and fit a nearest neighbors model using the `NearestNeighbors` class from cuML."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 23,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from cuml.neighbors import NearestNeighbors\n",
+ "\n",
+ "\n",
+ "knn = NearestNeighbors()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 24,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "NearestNeighbors(n_neighbors=5, verbose=4, handle=, algorithm='brute', metric='euclidean', p=2, algo_params=None, metric_params=None, output_type='input')"
+ ]
+ },
+ "execution_count": 24,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "knn.fit(X_df)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Once our model has been built and fitted to the data, we can query the model for the `k` nearest neighbors to each data point. The query returns a matrix representating the distances of each data point to its nearest `k` neighbors as well as the indices of those neighbors."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "k = 3\n",
+ "\n",
+ "distances, indices = knn.kneighbors(X_df, n_neighbors=k)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can iterate through each of our data points and do a majority vote to determine which class it belongs to."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 27,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "predictions = []\n",
+ "cp_y = cp.asarray(y_df)\n",
+ "for i in range(indices.shape[0]):\n",
+ " row = indices.iloc[i, :].values\n",
+ " vote = sum(cp_y[j] for j in row) / k\n",
+ " predictions.append(1.0 * (vote > 0.5))\n",
+ "\n",
+ "predictions = np.asarray(predictions).astype(np.float32)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Lastly, we can visualize the predictions from our K Nearest Neighbors classifier - we see that despite the non-linearity of the data, the algorithm does an excellent job of classifying the data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 28,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ "
"
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "f, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 3))\n",
+ "\n",
+ "\n",
+ "ax1.scatter(X[y == 0, 0], X[y == 0, 1],\n",
+ " edgecolor='black',\n",
+ " c='lightblue', marker='o', s=40, label='cluster 1')\n",
+ "ax1.scatter(X[y == 1, 0], X[y == 1, 1],\n",
+ " edgecolor='black',\n",
+ " c='red', marker='s', s=40, label='cluster 2')\n",
+ "ax1.set_title('empirical data points')\n",
+ "\n",
+ "\n",
+ "ax2.scatter(X[predictions == 0, 0], X[predictions == 0, 1], c='lightblue',\n",
+ " edgecolor='black',\n",
+ " marker='o', s=40, label='cluster 1')\n",
+ "ax2.scatter(X[predictions == 1, 0], X[predictions == 1, 1], c='red',\n",
+ " edgecolor='black',\n",
+ " marker='s', s=40, label='cluster 2')\n",
+ "ax2.set_title('KNN predicted classes')\n",
+ "\n",
+ "plt.legend()\n",
+ "plt.tight_layout()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## Conclusion\n",
+ "\n",
+ "In this notebook, we showed to do GPU accelerated Supervised Learning in RAPIDS. \n",
+ "\n",
+ "To learn more about RAPIDS, be sure to check out: \n",
+ "\n",
+ "* [Open Source Website](http://rapids.ai)\n",
+ "* [GitHub](https://github.com/rapidsai/)\n",
+ "* [Press Release](https://nvidianews.nvidia.com/news/nvidia-introduces-rapids-open-source-gpu-acceleration-platform-for-large-scale-data-analytics-and-machine-learning)\n",
+ "* [NVIDIA Blog](https://blogs.nvidia.com/blog/2018/10/10/rapids-data-science-open-source-community/)\n",
+ "* [Developer Blog](https://devblogs.nvidia.com/gpu-accelerated-analytics-rapids/)\n",
+ "* [NVIDIA Data Science Webpage](https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/)\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.8"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/getting_started_notebooks/intro_tutorials/07_Introduction_to_XGBoost.ipynb b/getting_started_materials/intro_tutorials_and_guides/07_Introduction_to_XGBoost.ipynb
similarity index 100%
rename from getting_started_notebooks/intro_tutorials/07_Introduction_to_XGBoost.ipynb
rename to getting_started_materials/intro_tutorials_and_guides/07_Introduction_to_XGBoost.ipynb
diff --git a/getting_started_notebooks/intro_tutorials/08_Introduction_to_Dask_XGBoost.ipynb b/getting_started_materials/intro_tutorials_and_guides/08_Introduction_to_Dask_XGBoost.ipynb
similarity index 100%
rename from getting_started_notebooks/intro_tutorials/08_Introduction_to_Dask_XGBoost.ipynb
rename to getting_started_materials/intro_tutorials_and_guides/08_Introduction_to_Dask_XGBoost.ipynb
diff --git a/getting_started_notebooks/intro_tutorials/09_Introduction_to_Dimensionality_Reduction.ipynb b/getting_started_materials/intro_tutorials_and_guides/09_Introduction_to_Dimensionality_Reduction.ipynb
similarity index 100%
rename from getting_started_notebooks/intro_tutorials/09_Introduction_to_Dimensionality_Reduction.ipynb
rename to getting_started_materials/intro_tutorials_and_guides/09_Introduction_to_Dimensionality_Reduction.ipynb
diff --git a/getting_started_notebooks/intro_tutorials/10_Introduction_to_Clustering.ipynb b/getting_started_materials/intro_tutorials_and_guides/10_Introduction_to_Clustering.ipynb
similarity index 100%
rename from getting_started_notebooks/intro_tutorials/10_Introduction_to_Clustering.ipynb
rename to getting_started_materials/intro_tutorials_and_guides/10_Introduction_to_Clustering.ipynb
diff --git a/getting_started_materials/intro_tutorials_and_guides/11_Introduction_to_Strings.ipynb b/getting_started_materials/intro_tutorials_and_guides/11_Introduction_to_Strings.ipynb
new file mode 100644
index 00000000..f6ef887e
--- /dev/null
+++ b/getting_started_materials/intro_tutorials_and_guides/11_Introduction_to_Strings.ipynb
@@ -0,0 +1,3460 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Draft Intro into Strings \n",
+ "\n",
+ "**Authorship** \n",
+ "Original Author: Nicholas Davis \n",
+ "Last Edit: Nicholas Davis, 4/19/2021 \n",
+ "\n",
+ "**Test System Specs** \n",
+ "Test System Hardware: Tesla T4 \n",
+ "Test System Software: Ubuntu 18.04-py3.7 \n",
+ "RAPIDS Version: 0.18. - Docker Install \n",
+ "Driver: 450.80.02 \n",
+ "CUDA: 11.0 \n",
+ "\n",
+ "\n",
+ "**Known Working Systems** \n",
+ "RAPIDS Versions: 0.18"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Working with text data \n",
+ "\n",
+ "Enterprise analytics workflows commonly require processing large-scale text data. To address this need, the RAPIDS CUDA DataFrame library (cuDF) and RAPIDS CUDA Machine Learning library (cuML) now include string processing capabilities. cuDF has a fully-featured string and regular expression processing engine. With a pandas-like API, cuDF string analytics can provide data scientists with up to 90x performance improvement with minimal changes to their code. \n",
+ "\n",
+ "This notebook serves as an intro to string capabilities with cuDF. Each string functionality will have a pandas example and it's cuDF equivalent. \n",
+ "\n",
+ "For any additional information please reference: \n",
+ "[cuDF Documentation](https://docs.rapids.ai/api/cudf/stable/api.html#strings)
"
+ ],
+ "text/plain": [
+ " 0\n",
+ "0 1\n",
+ "1 2\n",
+ "2 "
+ ]
+ },
+ "execution_count": 58,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "cudf.Series([\"a1\", \"b2\", \"c3\"], dtype=\"str\").str.extract(r\"[ab](\\d)\", expand=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "It returns a Series if expand=False."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 59,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0 1\n",
+ "1 2\n",
+ "2 \n",
+ "dtype: string"
+ ]
+ },
+ "execution_count": 59,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "pd.Series([\"a1\", \"b2\", \"c3\"], dtype=\"string\").str.extract(r\"[ab](\\d)\", expand=False)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 60,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0 1\n",
+ "1 2\n",
+ "2 \n",
+ "dtype: object"
+ ]
+ },
+ "execution_count": 60,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "cudf.Series([\"a1\", \"b2\", \"c3\"], dtype=\"str\").str.extract(r\"[ab](\\d)\", expand=False)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "When each subject string in the Series has exactly one match."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 61,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "0 a3\n",
+ "1 b3\n",
+ "2 c2\n",
+ "dtype: string\n"
+ ]
+ }
+ ],
+ "source": [
+ "pandasSeries = pd.Series([\"a3\", \"b3\", \"c2\"], dtype=\"string\")\n",
+ "print(pandasSeries)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 62,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "0 a3\n",
+ "1 b3\n",
+ "2 c2\n",
+ "dtype: object\n"
+ ]
+ }
+ ],
+ "source": [
+ "cudfSeries = cudf.Series([\"a3\", \"b3\", \"c2\"], dtype=\"str\")\n",
+ "print(cudfSeries)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## Testing for strings that match or contain a pattern"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "You can check whether elements contain a pattern:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 63,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0 False\n",
+ "1 False\n",
+ "2 True\n",
+ "3 True\n",
+ "4 True\n",
+ "5 True\n",
+ "dtype: bool"
+ ]
+ },
+ "execution_count": 63,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "pattern = r\"[0-9][a-z]\"\n",
+ "\n",
+ "pd.Series([\"1\", \"2\", \"3a\", \"3b\", \"03c\", \"4dx\"],dtype=\"str\",\n",
+ " ).str.contains(pattern)\n",
+ " "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 64,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0 False\n",
+ "1 False\n",
+ "2 True\n",
+ "3 True\n",
+ "4 True\n",
+ "5 True\n",
+ "dtype: bool"
+ ]
+ },
+ "execution_count": 64,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "pattern = r\"[0-9][a-z]\"\n",
+ "\n",
+ "cudf.Series([\"1\", \"2\", \"3a\", \"3b\", \"03c\", \"4dx\"],dtype=\"str\",\n",
+ " ).str.contains(pattern)\n",
+ " "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "Or whether elements match a pattern:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 65,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0 False\n",
+ "1 False\n",
+ "2 True\n",
+ "3 True\n",
+ "4 False\n",
+ "5 True\n",
+ "dtype: boolean"
+ ]
+ },
+ "execution_count": 65,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "pd.Series([\"1\", \"2\", \"3a\", \"3b\", \"03c\", \"4dx\"],dtype=\"string\",\n",
+ " ).str.match(pattern)\n",
+ " "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 66,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0 False\n",
+ "1 False\n",
+ "2 True\n",
+ "3 True\n",
+ "4 False\n",
+ "5 True\n",
+ "dtype: bool"
+ ]
+ },
+ "execution_count": 66,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "cudf.Series([\"1\", \"2\", \"3a\", \"3b\", \"03c\", \"4dx\"],dtype=\"str\",\n",
+ " ).str.match(pattern) "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "New in version 1.1.0."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 67,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0 False\n",
+ "1 False\n",
+ "2 True\n",
+ "3 True\n",
+ "4 False\n",
+ "5 False\n",
+ "dtype: boolean"
+ ]
+ },
+ "execution_count": 67,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "pd.Series([\"1\", \"2\", \"3a\", \"3b\", \"03c\", \"4dx\"],dtype=\"string\",\n",
+ " ).str.fullmatch(pattern)\n",
+ " "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 68,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0 False\n",
+ "1 False\n",
+ "2 True\n",
+ "3 True\n",
+ "4 False\n",
+ "5 True\n",
+ "dtype: bool"
+ ]
+ },
+ "execution_count": 68,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "cudf.Series([\"1\", \"2\", \"3a\", \"3b\", \"03c\", \"4dx\"],dtype=\"str\",\n",
+ " ).str.match(pattern)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "Methods like match, fullmatch, contains, startswith, and endswith take an extra na argument so missing values can be considered True or False:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 69,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Strings that contain 'A':\n",
+ "0 True\n",
+ "1 False\n",
+ "2 False\n",
+ "3 True\n",
+ "4 False\n",
+ "5 False\n",
+ "6 True\n",
+ "7 False\n",
+ "8 False\n",
+ "dtype: boolean\n",
+ "\n",
+ "Strings that have swapped case:\n",
+ "0 a\n",
+ "1 b\n",
+ "2 c\n",
+ "3 aABA\n",
+ "4 bACA\n",
+ "5 \n",
+ "6 caba\n",
+ "7 DOG\n",
+ "8 CAT\n",
+ "dtype: string\n",
+ "\n",
+ "Strings that start with 'b':\n",
+ "0 False\n",
+ "1 False\n",
+ "2 False\n",
+ "3 False\n",
+ "4 False\n",
+ "5 \n",
+ "6 False\n",
+ "7 False\n",
+ "8 False\n",
+ "dtype: boolean\n",
+ "\n",
+ "Strings that ends with 'a':\n",
+ "0 False\n",
+ "1 False\n",
+ "2 False\n",
+ "3 True\n",
+ "4 True\n",
+ "5 \n",
+ "6 False\n",
+ "7 False\n",
+ "8 False\n",
+ "dtype: boolean\n"
+ ]
+ }
+ ],
+ "source": [
+ "pandasSeries5 = pd.Series([\"A\", \"B\", \"C\", \"Aaba\", \"Baca\", np.nan, \"CABA\", \"dog\", \"cat\"], dtype=\"string\") \n",
+ "print(\"Strings that contain 'A':\")\n",
+ "print(pandasSeries5.str.contains(\"A\", na=False))\n",
+ "print(\"\\nStrings that have swapped case:\")\n",
+ "print(pandasSeries5.str.swapcase())\n",
+ "print(\"\\nStrings that start with 'b':\")\n",
+ "print(pandasSeries5.str.startswith ('b'))\n",
+ "print((\"\\nStrings that ends with 'a':\"))\n",
+ "print(pandasSeries5.str.endswith ('a'))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 70,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Strings that contain 'A':\n",
+ "0 True\n",
+ "1 False\n",
+ "2 False\n",
+ "3 True\n",
+ "4 False\n",
+ "5 \n",
+ "6 True\n",
+ "7 False\n",
+ "8 False\n",
+ "dtype: bool\n",
+ "\n",
+ "Strings that have swapped case:\n",
+ "0 a\n",
+ "1 b\n",
+ "2 c\n",
+ "3 aABA\n",
+ "4 bACA\n",
+ "5 \n",
+ "6 caba\n",
+ "7 DOG\n",
+ "8 CAT\n",
+ "dtype: object\n",
+ "\n",
+ "Strings that start with 'b':\n",
+ "0 False\n",
+ "1 False\n",
+ "2 False\n",
+ "3 False\n",
+ "4 False\n",
+ "5 \n",
+ "6 False\n",
+ "7 False\n",
+ "8 False\n",
+ "dtype: bool\n",
+ "\n",
+ "Strings that ends with 'a':\n",
+ "0 False\n",
+ "1 False\n",
+ "2 False\n",
+ "3 True\n",
+ "4 True\n",
+ "5 \n",
+ "6 False\n",
+ "7 False\n",
+ "8 False\n",
+ "dtype: bool\n"
+ ]
+ }
+ ],
+ "source": [
+ "cudfSeries5 = cudf.Series([\"A\", \"B\", \"C\", \"Aaba\", \"Baca\", np.nan, \"CABA\", \"dog\", \"cat\"], dtype=\"str\") \n",
+ "print(\"Strings that contain 'A':\")\n",
+ "print(cudfSeries5.str.contains(\"A\"))\n",
+ "print(\"\\nStrings that have swapped case:\")\n",
+ "print(cudfSeries5.str.swapcase())\n",
+ "print(\"\\nStrings that start with 'b':\")\n",
+ "print(cudfSeries5.str.startswith ('b'))\n",
+ "print((\"\\nStrings that ends with 'a':\"))\n",
+ "print(cudfSeries5.str.endswith ('a'))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.10"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/getting_started_notebooks/intro_tutorials/README.md b/getting_started_materials/intro_tutorials_and_guides/README.md
similarity index 100%
rename from getting_started_notebooks/intro_tutorials/README.md
rename to getting_started_materials/intro_tutorials_and_guides/README.md
diff --git a/getting_started_notebooks/basics/Getting_Started_with_Dask.ipynb b/getting_started_notebooks/basics/Getting_Started_with_Dask.ipynb
deleted file mode 100644
index ea219b2c..00000000
--- a/getting_started_notebooks/basics/Getting_Started_with_Dask.ipynb
+++ /dev/null
@@ -1,415 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "## Introduction to Dask\n",
- "#### By Paul Hendricks\n",
- "-------\n",
- "\n",
- "In this notebook, we will show how to get started with Dask using basic Python primitives like integers and strings.\n",
- "\n",
- "**Table of Contents**\n",
- "\n",
- "* [Introduction to Dask](#introduction)\n",
- "* [Setup](#setup)\n",
- "* [Introduction to Dask](#dask)\n",
- "* [Conclusion](#conclusion)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "## Setup\n",
- "\n",
- "This notebook was tested using the following Docker containers:\n",
- "\n",
- "* `rapidsai/rapidsai-dev-nightly:0.10-cuda10.0-devel-ubuntu18.04-py3.7` from [DockerHub](https://hub.docker.com/r/rapidsai/rapidsai-nightly)\n",
- "\n",
- "This notebook was run on the NVIDIA GV100 GPU. Please be aware that your system may be different and you may need to modify the code or install packages to run the below examples. \n",
- "\n",
- "If you think you have found a bug or an error, please file an issue here: https://github.com/rapidsai/notebooks-contrib/issues\n",
- "\n",
- "Before we begin, let's check out our hardware setup by running the `nvidia-smi` command."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "!nvidia-smi"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Next, let's see what CUDA version we have:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "!nvcc --version"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "!apt update\n",
- "!apt install -y graphviz\n",
- "!conda install graphviz\n",
- "!conda install python-graphviz"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "## Introduction to Dask\n",
- "\n",
- "Dask is a library that allows for parallelized computing. Written in Python, it allows one to compose complex workflows using large data structures like those found in NumPy, Pandas, and cuDF. In the following examples and notebooks, we'll show how to use Dask with cuDF to accelerate common ETL tasks as well as build and train machine learning models like Linear Regression and XGBoost.\n",
- "\n",
- "To learn more about Dask, check out the documentation here: http://docs.dask.org/en/latest/\n",
- "\n",
- "#### Client/Workers\n",
- "\n",
- "Dask operates by creating a cluster composed of a \"client\" and multiple \"workers\". The client is responsible for scheduling work; the workers are responsible for actually executing that work. \n",
- "\n",
- "Typically, we set the number of workers to be equal to the number of computing resources we have available to us. For CPU based workflows, this might be the number of cores or threads on that particlular machine. For example, we might set `n_workers = 8` if we have 8 CPU cores or threads on our machine that can each operate in parallel. This allows us to take advantage of all of our computing resources and enjoy the most benefits from parallelization.\n",
- "\n",
- "On a system with one or more GPUs, we usually set the number of workers equal to the number of GPUs available to us. Dask is a first class citizen in the world of General Purpose GPU computing and the RAPIDS ecosystem makes it very easy to use Dask with cuDF and XGBoost. \n",
- "\n",
- "Before we get started with Dask, we need to setup a Local Cluster of workers to execute our work and a Client to coordinate and schedule work for that cluster. As we see below, we can inititate a `cluster` and `client` using only few lines of code."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import dask; print('Dask Version:', dask.__version__)\n",
- "from dask.distributed import Client, LocalCluster\n",
- "import subprocess\n",
- "\n",
- "# parse the hostname IP address\n",
- "cmd = \"hostname --all-ip-addresses\"\n",
- "process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE)\n",
- "output, error = process.communicate()\n",
- "ip_address = str(output.decode()).split()[0]\n",
- "\n",
- "# create a local cluster with 4 workers\n",
- "n_workers = 4\n",
- "cluster = LocalCluster(ip=ip_address, n_workers=n_workers)\n",
- "client = Client(cluster)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Let's inspect the `client` object to view our current Dask status. We should see the IP Address for our Scheduler as well as the the number of workers in our Cluster. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# show current Dask status\n",
- "client"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "You can also see the status and more information at the Dashboard, found at `http:///status`. You can ignore this for now, we'll dive into this in subsequent tutorials.\n",
- "\n",
- "With our client and workers setup, it's time to execute our first program in parallel. We'll define a function that takes some value `x` and adds 5 to it."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "def add_5_to_x(x):\n",
- " return x + 5"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Next, we'll iterate through our `n_workers` and create an execution graph, where each worker is responsible for taking its ID and passing it to the function `add_5_to_x`. For example, the worker with ID 2 will take its ID and add 5, resulting in the value 7."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from dask import delayed\n",
- "\n",
- "addition_operations = [delayed(add_5_to_x)(i) for i in range(n_workers)]\n",
- "addition_operations"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The above output shows a list of several `Delayed` objects. An important thing to note is that the workers aren't actually executing these results - we're just defining the execution graph for our client to execute later. The `delayed` function wraps our function `add_5_to_x` and returns a `Delayed` object. This ensures that this computation is in fact \"delayed\" - or lazily evaluated - and not executed on the spot i.e. when we define it.\n",
- "\n",
- "Next, let's sum each one of these intermediate results. We can accomplish this by wrapping Python's built-in `sum` function using our `delayed` function and storing this in a variable called `total`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "total = delayed(sum)(addition_operations)\n",
- "total"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Using the `graphviz` library, we can use the `visualize` method of a `Delayed` object to visualize our current graph."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "total.visualize()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "As we mentioned before, none of these results - intermediate or final - have actually been compute. We can compute them using the `compute` method of our `client`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import time\n",
- "\n",
- "addition_futures = client.compute(addition_operations, optimize_graph=False, fifo_timeout=\"0ms\")\n",
- "total_future = client.compute(total, optimize_graph=False, fifo_timeout=\"0ms\")\n",
- "time.sleep(1) # this will give Dask time to execute each worker"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Let's inspect the output of each call to `client.compute`:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "addition_futures"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can see from the above output that our `addition_futures` variable is a list of `Future` objects - not the \"actual results\" of adding 5 to each of `[0, 1, 2, 3]`. These `Future` objects are a promise that at one point a computation will take place and we will be left with a result. Dask is responsible for ensuring that promise by delegating that task to the appropriate Dask worker and collecting the result.\n",
- "\n",
- "Let's take a look at our `total_future` object:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "print(total_future)\n",
- "print(type(total_future))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Again, we see that this is an object of type `Future` as well as metadata about the status of the request (i.e. whether it has finished or not), the type of the result, and a key associated with that operation. To collect and print the result of each of these `Future` objects, we can call the `result()` method."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "addition_results = [future.result() for future in addition_futures]\n",
- "print('Addition Results:', addition_results)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Now we see the results that we want from our addition operations. We can also use the simpler syntax of the `client.gather` method to collect our results."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "addition_results = client.gather(addition_futures)\n",
- "total_result = client.gather(total_future)\n",
- "print('Addition Results:', addition_results)\n",
- "print('Total Result:', total_result)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Awesome! We just wrote our first distributed workflow.\n",
- "\n",
- "To confirm that Dask is truly executing in parallel, let's define a function that sleeps for 1 second and returns the string \"Success!\". In serial, this function should take our 4 workers around 4 seconds to execute."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "def sleep_1():\n",
- " time.sleep(1)\n",
- " return 'Success!'"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "%%time\n",
- "\n",
- "for _ in range(n_workers):\n",
- " sleep_1()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "As expected, our process takes about 4 seconds to run. Now let's execute this same workflow in parallel using Dask."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "%%time\n",
- "\n",
- "# define delayed execution graph\n",
- "sleep_operations = [delayed(sleep_1)() for _ in range(n_workers)]\n",
- "\n",
- "# use client to perform computations using execution graph\n",
- "sleep_futures = client.compute(sleep_operations, optimize_graph=False, fifo_timeout=\"0ms\")\n",
- "\n",
- "# collect and print results\n",
- "sleep_results = client.gather(sleep_futures)\n",
- "print(sleep_results)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Using Dask, we see that this whole process takes a little over a second - each worker is executing in parallel!"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "## Conclusion\n",
- "\n",
- "In this tutorial, we learned how to use Dask with basic Python primitives like integers and strings.\n",
- "\n",
- "To learn more about RAPIDS, be sure to check out: \n",
- "\n",
- "* [Open Source Website](http://rapids.ai)\n",
- "* [GitHub](https://github.com/rapidsai/)\n",
- "* [Press Release](https://nvidianews.nvidia.com/news/nvidia-introduces-rapids-open-source-gpu-acceleration-platform-for-large-scale-data-analytics-and-machine-learning)\n",
- "* [NVIDIA Blog](https://blogs.nvidia.com/blog/2018/10/10/rapids-data-science-open-source-community/)\n",
- "* [Developer Blog](https://devblogs.nvidia.com/gpu-accelerated-analytics-rapids/)\n",
- "* [NVIDIA Data Science Webpage](https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.7.3"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
diff --git a/getting_started_notebooks/basics/Getting_Started_with_cuDF.ipynb b/getting_started_notebooks/basics/Getting_Started_with_cuDF.ipynb
deleted file mode 100644
index 5b307586..00000000
--- a/getting_started_notebooks/basics/Getting_Started_with_cuDF.ipynb
+++ /dev/null
@@ -1,537 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Getting Started with cuDF\n",
- "#### By Yi Dong, Paul Hendricks\n",
- "-------\n",
- "\n",
- "While the world’s data doubles each year, CPU computing has hit a brick wall with the end of Moore’s law. For the same reasons, scientific computing and deep learning has turned to NVIDIA GPU acceleration, data analytics and machine learning where GPU acceleration is ideal. \n",
- "\n",
- "NVIDIA created RAPIDS – an open-source data analytics and machine learning acceleration platform that leverages GPUs to accelerate computations. RAPIDS is based on Python, has pandas-like and Scikit-Learn-like interfaces, is built on Apache Arrow in-memory data format, and can scale from 1 to multi-GPU to multi-nodes. RAPIDS integrates easily into the world’s most popular data science Python-based workflows. RAPIDS accelerates data science end-to-end – from data prep, to machine learning, to deep learning. And through Arrow, Spark users can easily move data into the RAPIDS platform for acceleration.\n",
- "\n",
- "In this notebook, we will also show how to get started with GPU DataFrames using cuDF in RAPIDS.\n",
- "\n",
- "**Table of Contents**\n",
- "\n",
- "* Setup\n",
- "* Loading data into a GPU DataFrame (GDF)\n",
- " * Loading data into a Pandas DataFrame\n",
- " * Converting a Pandas DataFrame to a GDF\n",
- "* Working with the GDF\n",
- " * Take a look at the columns and their data types\n",
- " * Slice the GDF\n",
- " * Modify data types\n",
- " * Manipulate data with a user-defined function (UDF)\n",
- " * Sort the data\n",
- " * Filter the data\n",
- " * One-hot encode categorical columns\n",
- " * Split the data into training and validation sets\n",
- " * Turn the GDFs into matrices\n",
- "* Conclusion"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Setup\n",
- "\n",
- "This notebook was tested using the `rapidsai/rapidsai-dev-nightly:0.10-cuda10.0-devel-ubuntu18.04-py3.7` container from [DockerHub](https://hub.docker.com/r/rapidsai/rapidsai-nightly) and run on the NVIDIA GV100 GPU. Please be aware that your system may be different and you may need to modify the code or install packages to run the below examples. \n",
- "\n",
- "If you think you have found a bug or an error, please file an issue here: https://github.com/rapidsai/notebooks-contrib/issues\n",
- "\n",
- "Before we begin, let's check out our hardware setup by running the `nvidia-smi` command."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "!nvidia-smi"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Next, let's see what CUDA version we have:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "!nvcc --version"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Last, let's ensure that we have graphviz installed"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 17,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "done\n",
- "\n",
- "\n",
- "==> WARNING: A newer version of conda exists. <==\n",
- " current version: 4.6.14\n",
- " latest version: 4.7.12\n",
- "\n",
- "Please update conda by running\n",
- "\n",
- " $ conda update -n base -c defaults conda\n",
- "\n",
- "\n",
- "\n",
- "## Package Plan ##\n",
- "\n",
- " environment location: /opt/conda/envs/rapids\n",
- "\n",
- " added / updated specs:\n",
- " - python-graphviz\n",
- "\n",
- "\n",
- "The following packages will be downloaded:\n",
- "\n",
- " package | build\n",
- " ---------------------------|-----------------\n",
- " python-graphviz-0.10.1 | py_0 22 KB\n",
- " ------------------------------------------------------------\n",
- " Total: 22 KB\n",
- "\n",
- "The following packages will be SUPERSEDED by a higher-priority channel:\n",
- "\n",
- " python-graphviz conda-forge::python-graphviz-0.13-py_0 --> pkgs/main::python-graphviz-0.10.1-py_0\n",
- "\n",
- "\n",
- "\n",
- "Downloading and Extracting Packages\n",
- "python-graphviz-0.10 | 22 KB | ##################################### | 100% \n",
- "Preparing transaction: done\n",
- "Verifying transaction: done\n",
- "Executing transaction: done\n"
- ]
- }
- ],
- "source": [
- "!apt update\n",
- "!apt install -y graphviz\n",
- "!conda install -y graphviz\n",
- "!conda install -y python-graphviz"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Loading data into a GPU DataFrame (GDF)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Loading data into a Pandas DataFrame\n",
- "\n",
- "It's easy to load almost any sort of data (json, csv, etc) into a Pandas DataFrame. \n",
- "For example, let's import some census data from a compressed CSV file on disk:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import pandas as pd; print('pandas Version:', pd.__version__)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# read data from csv file into pandas dataframe\n",
- "df = pd.read_csv('../../data/ipums/ipums_easy.csv')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "[Read more on using a Pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Converting a Pandas DataFrame to a GDF\n",
- "\n",
- "Next, we use our `pandas.DataFrame` and to create a `cudf.dataframe.DataFrame` object using the `from_pandas` method."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import cudf\n",
- "\n",
- "# convert the Panda dataframe into a GPU dataframe\n",
- "gdf = cudf.DataFrame.from_pandas(df)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "And that's it! For the most part, working with GPU DataFrames will be the same as working with Pandas DataFrames. See the [cuDF documentation](https://cudf.readthedocs.io/en/latest/index.html) for more information.\n",
- "\n",
- "## Working with the GDF"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Take a look at the columns and their data types"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# print the columns and their datatypes in this gdf\n",
- "gdf.dtypes"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Slice the GDF\n",
- "\n",
- "Woah! This GDF has a lot of columns, let's make it more manageable..."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# only select certain columns (and overwrite the gdf)\n",
- "column_names = [\n",
- " 'INCEARN', 'PERWT', 'ADJUST', 'STATEICP', 'ROOMS', 'BEDROOMS',\n",
- " 'PHONE', 'VEHICLES', 'RACE', 'SEX', 'AGE', 'VETSTAT'\n",
- "]\n",
- "gdf = gdf.loc[:, column_names]\n",
- "\n",
- "# show the first 5 records of each column\n",
- "print(gdf.head(5))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Modify data types"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "gdf.dtypes"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Looks like `INCEARN` and `PERWT` are integers when they should be floats. Let's fix that..."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import numpy as np\n",
- "\n",
- "# convert the following two int64 columns to float64 data type\n",
- "gdf['INCEARN'] = gdf['INCEARN'].astype(np.float64)\n",
- "gdf['PERWT'] = gdf['PERWT'].astype(np.float64)\n",
- "\n",
- "# take another look\n",
- "gdf.dtypes"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Manipulate data with a user-defined function (UDF)\n",
- "\n",
- "`INCEARN` is a column in our dataset that supposedly represents income earned; however, it does not truly represent the amount of income earned when adjusted for inflation. The `ADJUST` column represents the dollar inflation factor, which we can use to adjust `INCEARN` to the amount that the individual would have earned during the calender year. In our dataset, `ADJUST` is constant over all rows.\n",
- "\n",
- "Below, we will define a simple function `adjust_incearn` that takes `INCEARN` and and multiplies it by a constant - in this case, the dollar inflation factor. We'll use the `applymap` method in our `cudf.dataframe.DataFrame` object to apply an element-wise function to transform the values in the Column."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# define a function to adjust the incearn column\n",
- "# so it more accurately represents income earned\n",
- "adjust = gdf['ADJUST'][0] # take constant from first row\n",
- "print('adjustment factor: {}'.format(adjust))\n",
- "def adjust_incearn(incearn):\n",
- " return adjust * incearn;\n",
- "\n",
- "# apply it to the 'population' column\n",
- "gdf['INCEARN'] = gdf['INCEARN'].applymap(adjust_incearn)\n",
- "\n",
- "# drop the ADJUST column\n",
- "gdf.drop_column('ADJUST')\n",
- "\n",
- "# compute the mean\n",
- "print('mean adjusted income: {}'.format(gdf['INCEARN'].mean()))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Sort the data\n",
- "\n",
- "Next, let's sort out data to do some light exploration."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# sort the gdf by the INCEARN column\n",
- "gdf = gdf.sort_values(by='INCEARN', ascending=True)\n",
- "# reset the index so we can use loc slicing later\n",
- "gdf = gdf.reset_index()\n",
- "print(gdf.head(5))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Looks like we have some negative income values. Let's filter those out...\n",
- "\n",
- "### Filter the data\n",
- "\n",
- "We'll use the `query` method to filter our dataset. The `query` method takes as argument a boolean expression very similar to the `query` method for the `pandas.DataFrame` class. However, the `cudf.dataframe.DataFrame` implementation uses Numba to compile a GPU kernel. \n",
- "\n",
- "For more information on the syntax for arguments into `query`, see the Pandas documentation: \n",
- "\n",
- "https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# how many records do we have?\n",
- "print(\"{} = Original # of records\".format(len(gdf)))\n",
- "\n",
- "# filter out\n",
- "gdf = gdf.query('INCEARN >= 0')\n",
- "\n",
- "# how many records do we have left?\n",
- "print(\"{} = New # of records\".format(len(gdf)))\n",
- "\n",
- "# sanity check...\n",
- "print(gdf.head(5))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### One-hot encode categorical columns\n",
- "\n",
- "Next, let's prepare our categorical columns. Machine learning models won't take strings as inputs, so we need to go to each column and convert its string representations to a numerical representation. The most common way to convert a Column with `n` elements and `k` unique categories to a numerical representation is to create a matrix of shape `n` by `k` and impute a 1 in cell `(i, j)` if the `ith` element is of category `j` and 0 otherwise, where $j \\in k$. This is known as one-hot encoding."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# define the categorical columns\n",
- "cat_cols = set(['STATEICP', 'RACE', 'SEX', 'VETSTAT'])\n",
- "# store the unique values for each category column\n",
- "uniques = {}\n",
- "\n",
- "# iterate through each categorical column and one-hot\n",
- "# encode it using the unique values it has\n",
- "for k in cat_cols:\n",
- " uniques[k] = gdf[k].unique_k(k=1000)\n",
- " cats = uniques[k][1:] # drop first\n",
- " gdf = gdf.one_hot_encoding(k, prefix=k, cats=cats)\n",
- " del gdf[k]\n",
- " \n",
- "# we should see many more columns since the categorical\n",
- "# columns will get expanded due to one-hot encoding\n",
- "gdf.dtypes"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Split the data into training and validation sets\n",
- "\n",
- "Next, let's split out data into an 80% train dataset and a 20% validation dataset."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# enforce float64 data type on ALL columns\n",
- "for k in gdf.columns:\n",
- " gdf[k] = gdf[k].astype(np.float64)\n",
- "\n",
- "# set the fractions for training and validation\n",
- "fractions = {\n",
- " \"train\": 0.8,\n",
- " \"valid\": 0.2\n",
- "}\n",
- "\n",
- "# validation splitpoint\n",
- "splitpoint = int(len(gdf) * fractions[\"train\"])\n",
- "print('splitpoint: {} of {} is {}'.format(fractions[\"train\"], len(gdf), splitpoint))\n",
- "\n",
- "# break the gdf up into training and validation sets\n",
- "gdfs = {\n",
- " \"train\": gdf.loc[:splitpoint],\n",
- " \"valid\": gdf.loc[splitpoint:]\n",
- "}\n",
- "print('gdfs[\"train\"] has {} rows'.format(len(gdfs[\"train\"])))\n",
- "print('gdfs[\"valid\"] has {} rows'.format(len(gdfs[\"valid\"])))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Turn the GDFs into matrices\n",
- "\n",
- "Lastly, we want to convert our GPU DataFrame to a GPU Matrix for usage as input to other machine learning libraries such as cuML and XGBoost. We can use the `as_gpu_matrix` method to facillitate this conversion."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# produce gpu matrices (to input to ML libraries, etc.)\n",
- "# this step should not be necessary in the near future\n",
- "# (should be able to use gdf as input)\n",
- "matrices = {\n",
- " \"train\": {\n",
- " \"x\": gdfs[\"train\"].as_gpu_matrix(columns=gdf.columns[1:]),\n",
- " \"y\": gdfs[\"train\"].as_gpu_matrix(columns=[gdf.columns[0]])\n",
- " },\n",
- " \"valid\": {\n",
- " \"x\": gdfs[\"valid\"].as_gpu_matrix(columns=gdf.columns[1:]),\n",
- " \"y\": gdfs[\"valid\"].as_gpu_matrix(columns=[gdf.columns[0]])\n",
- " }\n",
- "}\n",
- "\n",
- "# check the matrix shapes (sanity check)\n",
- "print('matrices[\"train\"][\"x\"] shape:', matrices[\"train\"][\"x\"].shape)\n",
- "print('matrices[\"train\"][\"y\"] shape:', matrices[\"train\"][\"y\"].shape)\n",
- "print('matrices[\"valid\"][\"x\"] shape:', matrices[\"valid\"][\"x\"].shape)\n",
- "print('matrices[\"valid\"][\"y\"] shape:', matrices[\"valid\"][\"y\"].shape)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Conclusion\n",
- "\n",
- "To learn more about RAPIDS, be sure to check out: \n",
- "\n",
- "* [Open Source Website](http://rapids.ai)\n",
- "* [GitHub](https://github.com/rapidsai/)\n",
- "* [Press Release](https://nvidianews.nvidia.com/news/nvidia-introduces-rapids-open-source-gpu-acceleration-platform-for-large-scale-data-analytics-and-machine-learning)\n",
- "* [NVIDIA Blog](https://blogs.nvidia.com/blog/2018/10/10/rapids-data-science-open-source-community/)\n",
- "* [Developer Blog](https://devblogs.nvidia.com/gpu-accelerated-analytics-rapids/)\n",
- "* [NVIDIA Data Science Webpage](https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.7.3"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
diff --git a/getting_started_notebooks/intro_tutorials/01_Introduction_to_RAPIDS.ipynb b/getting_started_notebooks/intro_tutorials/01_Introduction_to_RAPIDS.ipynb
deleted file mode 100644
index b0124b36..00000000
--- a/getting_started_notebooks/intro_tutorials/01_Introduction_to_RAPIDS.ipynb
+++ /dev/null
@@ -1,708 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "## Introduction to RAPIDS\n",
- "#### By Paul Hendricks\n",
- "-------\n",
- "\n",
- "While the world’s data doubles each year, CPU computing has hit a brick wall with the end of Moore’s law. For the same reasons, scientific computing and deep learning has turned to NVIDIA GPU acceleration, data analytics and machine learning where GPU acceleration is ideal. \n",
- "\n",
- "NVIDIA created RAPIDS – an open-source data analytics and machine learning acceleration platform that leverages GPUs to accelerate computations. RAPIDS is based on Python, has Pandas-like and Scikit-Learn-like interfaces, is built on Apache Arrow in-memory data format, and can scale from 1 to multi-GPU to multi-nodes. RAPIDS integrates easily into the world’s most popular data science Python-based workflows. RAPIDS accelerates data science end-to-end – from data prep, to machine learning, to deep learning. And through Arrow, Spark users can easily move data into the RAPIDS platform for acceleration.\n",
- "\n",
- "In this notebook, we will discuss and show at a high level what each of the packages in the RAPIDS are as well as what they do. Subsequent notebooks will dive deeper into the various areas of data science and machine learning and show how you can use RAPIDS to accelerate your workflow in each of these areas.\n",
- "\n",
- "**Table of Contents**\n",
- "\n",
- "* [Introduction to RAPIDS](#introduction)\n",
- "* [Setup](#setup)\n",
- "* [Pandas](#pandas)\n",
- "* [cuDF](#cudf)\n",
- "* [Scikit-Learn](#scikitlearn)\n",
- "* [cuML](#cuml)\n",
- "* [Dask](#dask)\n",
- "* [Dask cuDF](#daskcudf)\n",
- "* [Conclusion](#conclusion)\n",
- "\n",
- "Before going any further, let's make sure we have access to `matplotlib`, a popular Python library for visualizing data."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import os\n",
- "\n",
- "try:\n",
- " import matplotlib\n",
- "except ModuleNotFoundError:\n",
- " os.system('conda install -y matplotlib')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "## Setup\n",
- "\n",
- "This notebook was tested using the following Docker containers:\n",
- "\n",
- "* `rapidsai/rapidsai-dev-nightly:0.10-cuda10.0-devel-ubuntu18.04-py3.7` container from [DockerHub](https://hub.docker.com/r/rapidsai/rapidsai-nightly)\n",
- "\n",
- "This notebook was run on the NVIDIA GV100 GPU. Please be aware that your system may be different and you may need to modify the code or install packages to run the below examples. \n",
- "\n",
- "If you think you have found a bug or an error, please file an issue here: https://github.com/rapidsai/notebooks-contrib/issues\n",
- "\n",
- "Before we begin, let's check out our hardware setup by running the `nvidia-smi` command."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "!nvidia-smi"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Next, let's see what CUDA version we have:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "!nvcc --version"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Next, let's load some helper functions from `matplotlib` and configure the Jupyter Notebook for visualization."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from matplotlib.colors import ListedColormap\n",
- "import matplotlib.pyplot as plt\n",
- "\n",
- "\n",
- "%matplotlib inline"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "## Pandas\n",
- "\n",
- "Data scientists typically work with two types of data: unstructured and structured. Unstructured data often comes in the form of text, images, or videos. Structured data - as the name suggests - comes in a structured form, often represented by a table or CSV. We'll focus the majority of these tutorials on working with these types of data.\n",
- "\n",
- "There exist many tools in the Python ecosystem for working with structured, tabular data but few are as widely used as Pandas. Pandas represents data in a table and allows a data scientist to manipulate the data to perform a number of useful operations such as filtering, transforming, aggregating, merging, visualizing and many more. \n",
- "\n",
- "For more information on Pandas, check out the excellent documentation: http://pandas.pydata.org/pandas-docs/stable/\n",
- "\n",
- "Below we show how to create a Pandas DataFrame, an internal object for representing tabular data."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import pandas as pd; print('Pandas Version:', pd.__version__)\n",
- "\n",
- "\n",
- "# here we create a Pandas DataFrame with\n",
- "# two columns named \"key\" and \"value\"\n",
- "df = pd.DataFrame()\n",
- "df['key'] = [0, 0, 2, 2, 3]\n",
- "df['value'] = [float(i + 10) for i in range(5)]\n",
- "print(df)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can perform many operations on this data. For example, let's say we wanted to sum all values in the in the `value` column. We could accomplish this using the following syntax:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "aggregation = df['value'].sum()\n",
- "print(aggregation)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "## cuDF\n",
- "\n",
- "Pandas is fantastic for working with small datasets that fit into your system's memory. However, datasets are growing larger and data scientists are working with increasingly complex workloads - the need for accelerated compute arises.\n",
- "\n",
- "cuDF is a package within the RAPIDS ecosystem that allows data scientists to easily migrate their existing Pandas workflows from CPU to GPU, where computations can leverage the immense parallelization that GPUs provide.\n",
- "\n",
- "Below, we show how to create a cuDF DataFrame."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import cudf; print('cuDF Version:', cudf.__version__)\n",
- "\n",
- "\n",
- "# here we create a cuDF DataFrame with\n",
- "# two columns named \"key\" and \"value\"\n",
- "df = cudf.DataFrame()\n",
- "df['key'] = [0, 0, 2, 2, 3]\n",
- "df['value'] = [float(i + 10) for i in range(5)]\n",
- "print(df)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "As before, we can take this cuDF DataFrame and perform a `sum` operation over the `value` column. The key difference is that any operations we perform using cuDF use the GPU instead of the CPU."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "aggregation = df['value'].sum()\n",
- "print(aggregation)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Note how the syntax for both creating and manipulating a cuDF DataFrame is identical to the syntax necessary to create and manipulate Pandas DataFrames; the cuDF API is based on the Pandas API. This design choice minimizes the cognitive burden of switching from a CPU based workflow to a GPU based workflow and allows data scientists to focus on solving problems while benefitting from the speed of a GPU!"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "## Scikit-Learn\n",
- "\n",
- "After our data has been preprocessed, we often want to build a model so as to understand the relationships between different variables in our data. Scikit-Learn is an incredibly powerful toolkit that allows data scientists to quickly build models from their data. Below we show a simple example of how to create a Linear Regression model."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import numpy as np; print('NumPy Version:', np.__version__)\n",
- "\n",
- "\n",
- "# create the relationship: y = 2.0 * x + 1.0\n",
- "n_rows = 100000 # let's use 100 thousand data points\n",
- "w = 2.0\n",
- "x = np.random.normal(loc=0, scale=1, size=(n_rows,))\n",
- "b = 1.0\n",
- "y = w * x + b\n",
- "\n",
- "# add a bit of noise\n",
- "noise = np.random.normal(loc=0, scale=2, size=(n_rows,))\n",
- "y_noisy = y + noise"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can now visualize our data using the `matplotlib` library."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "plt.scatter(x, y_noisy, label='empirical data points')\n",
- "plt.plot(x, y, color='black', label='true relationship')\n",
- "plt.legend()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We'll use the `LinearRegression` class from Scikit-Learn to instantiate a model and fit it to our data."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import sklearn; print('Scikit-Learn Version:', sklearn.__version__)\n",
- "from sklearn.linear_model import LinearRegression\n",
- "\n",
- "\n",
- "# instantiate and fit model\n",
- "linear_regression = LinearRegression()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "%%time\n",
- "\n",
- "linear_regression.fit(np.expand_dims(x, 1), y)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# create new data and perform inference\n",
- "inputs = np.linspace(start=-5, stop=5, num=1000)\n",
- "outputs = linear_regression.predict(np.expand_dims(inputs, 1))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Let's now visualize our empirical data points, the true relationship of the data, and the relationship estimated by the model. Looks pretty close!"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "plt.scatter(x, y_noisy, label='empirical data points')\n",
- "plt.plot(x, y, color='black', label='true relationship')\n",
- "plt.plot(inputs, outputs, color='red', label='predicted relationship (cpu)')\n",
- "plt.legend()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "## cuML\n",
- "\n",
- "The mathematical operations underlying many machine learning algorithms are often matrix multiplications. These types of operations are highly parallelizable and can be greatly accelerated using a GPU. cuML makes it easy to build machine learning models in an accelerated fashion while still using an interface nearly identical to Scikit-Learn. The below shows how to accomplish the same Linear Regression model but on a GPU.\n",
- "\n",
- "First, let's convert our data from a NumPy representation to a cuDF representation."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# create a cuDF DataFrame\n",
- "df = cudf.DataFrame({'x': x, 'y': y_noisy})\n",
- "print(df.head())"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Next, we'll load the GPU accelerated `LinearRegression` class from cuML, instantiate it, and fit it to our data."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import cuml; print('cuML Version:', cuml.__version__)\n",
- "from cuml.linear_model import LinearRegression as LinearRegression_GPU\n",
- "\n",
- "\n",
- "# instantiate and fit model\n",
- "linear_regression_gpu = LinearRegression_GPU()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "%%time\n",
- "\n",
- "linear_regression_gpu.fit(df[['x']], df['y'])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can use this model to predict values for new data points, a step often called \"inference\" or \"scoring\". All model fitting and predicting steps are GPU accelerated."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# create new data and perform inference\n",
- "new_data_df = cudf.DataFrame({'inputs': inputs})\n",
- "outputs_gpu = linear_regression_gpu.predict(new_data_df[['inputs']])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Lastly, we can overlay our predicted relationship using our GPU accelerated Linear Regression model (green line) over our empirical data points (light blue circles), the true relationship (blue line), and the predicted relationship from a model built on the CPU (red line). We see that our GPU accelerated model's estimate of the true relationship (green line) is identical to the CPU based model's estimate of the true relationship (red line)!"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "plt.scatter(x, y_noisy, label='empirical data points')\n",
- "plt.plot(x, y, color='black', label='true relationship')\n",
- "plt.plot(inputs, outputs, color='red', label='predicted relationship (cpu)')\n",
- "plt.plot(inputs, outputs_gpu.to_array(), color='green', label='predicted relationship (gpu)')\n",
- "plt.legend()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "## Dask\n",
- "\n",
- "Dask is a library the allows facillitates distributed computing. Written in Python, it allows one to compose complex workflows using basic Python primitives like integers or strings as well as large data structures like those found in NumPy, Pandas, and cuDF. In the following examples and notebooks, we'll show how to use Dask with cuDF to accelerate common ETL tasks and train machine learning models like Linear Regression and XGBoost.\n",
- "\n",
- "To learn more about Dask, check out the documentation here: http://docs.dask.org/en/latest/\n",
- "\n",
- "#### Client/Workers\n",
- "\n",
- "Dask operates by creating a cluster composed of a \"client\" and multiple \"workers\". The client is responsible for scheduling work; the workers are responsible for actually executing that work. \n",
- "\n",
- "Typically, we set the number of workers to be equal to the number of computing resources we have available to us. For CPU based workflows, this might be the number of cores or threads on that particlular machine. For example, we might set `n_workers = 8` if we have 8 CPU cores or threads on our machine that can each operate in parallel. This allows us to take advantage of all of our computing resources and enjoy the most benefits from parallelization.\n",
- "\n",
- "To get started, we'll create a local cluster of workers and client to interact with that cluster."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import dask; print('Dask Version:', dask.__version__)\n",
- "from dask.distributed import Client, LocalCluster\n",
- "\n",
- "\n",
- "# create a local cluster with 4 workers\n",
- "n_workers = 4\n",
- "cluster = LocalCluster(n_workers=n_workers)\n",
- "client = Client(cluster)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Let's inspect the `client` object to view our current Dask status. We should see the IP Address for our Scheduler as well as the the number of workers in our Cluster. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# show current Dask status\n",
- "client"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "You can also see the status and more information at the Dashboard, found at `http:///status`. You can ignore this for now, we'll dive into this in subsequent tutorials.\n",
- "\n",
- "With our client and cluster of workers setup, it's time to execute our first distributed program. We'll define a function called `sleep_1` that sleeps for 1 second and returns the string \"Success!\". Executed in serial four times, this function should take around 4 seconds to execute."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import time\n",
- "\n",
- "\n",
- "def sleep_1():\n",
- " time.sleep(1)\n",
- " return 'Success!'"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "%%time\n",
- "\n",
- "for _ in range(n_workers):\n",
- " sleep_1()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "As expected, our workflow takes about 4 seconds to run. Now let's execute this same workflow in distributed fashion using Dask."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from dask.delayed import delayed"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "%%time\n",
- "\n",
- "# define delayed execution graph\n",
- "sleep_operations = [delayed(sleep_1)() for _ in range(n_workers)]\n",
- "\n",
- "# use client to perform computations using execution graph\n",
- "sleep_futures = client.compute(sleep_operations, optimize_graph=False, fifo_timeout=\"0ms\")\n",
- "\n",
- "# collect and print results\n",
- "sleep_results = client.gather(sleep_futures)\n",
- "print(sleep_results)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Using Dask, we see that this whole workflow takes a little over a second - each worker is truly executing in parallel!"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "## Dask cuDF\n",
- "\n",
- "In the previous example, we saw how we can use Dask with very basic objects to compose a graph that can be executed in a distributed fashion. However, we aren't limited to basic data types though. \n",
- "\n",
- "We can use Dask with objects such as Pandas DataFrames, NumPy arrays, and cuDF DataFrames to compose more complex workflows. With larger amounts of data and embarrasingly parallel algorithms, Dask allows us to scale ETL and Machine Learning workflows to Gigabytes or Terabytes of data. In the below example, we show how we can process 100 million rows by combining cuDF with Dask.\n",
- "\n",
- "Before we start working with cuDF DataFrames with Dask, we need to setup a Local CUDA Cluster and Client to work with our GPUs. This is very similar to how we setup a Local Cluster and Client in vanilla Dask."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import dask; print('Dask Version:', dask.__version__)\n",
- "from dask.distributed import Client\n",
- "# import dask_cuda; print('Dask CUDA Version:', dask_cuda.__version__)\n",
- "from dask_cuda import LocalCUDACluster\n",
- "\n",
- "\n",
- "# create a local CUDA cluster\n",
- "cluster = LocalCUDACluster()\n",
- "client = Client(cluster)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Let's inspect our `client` object:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "client"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "As before, you can also see the status of the Client along with information on the Scheduler and Dashboard.\n",
- "\n",
- "With our client and workers setup, let's create our first distributed cuDF DataFrame using Dask. We'll instantiate our cuDF DataFrame in the same manner as the previous sections but instead we'll use significantly more data. Lastly, we'll pass the cuDF DataFrame to `dask_cudf.from_cudf` and create an object of type `dask_cudf.core.DataFrame`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import dask_cudf; print('Dask cuDF Version:', dask_cudf.__version__)\n",
- "\n",
- "\n",
- "# identify number of workers\n",
- "workers = client.has_what().keys()\n",
- "n_workers = len(workers)\n",
- "\n",
- "# create a cuDF DataFrame with two columns named \"key\" and \"value\"\n",
- "df = cudf.DataFrame()\n",
- "n_rows = 100000000 # let's process 100 million rows in a distributed parallel fashion\n",
- "df['key'] = np.random.binomial(1, 0.2, size=(n_rows))\n",
- "df['value'] = np.random.normal(size=(n_rows))\n",
- "\n",
- "# create a distributed cuDF DataFrame using Dask\n",
- "distributed_df = dask_cudf.from_cudf(df, npartitions=n_workers)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# inspect our distributed cuDF DataFrame using Dask\n",
- "print('-' * 15)\n",
- "print('Type of our Dask cuDF DataFrame:', type(distributed_df))\n",
- "print('-' * 15)\n",
- "print(distributed_df.head())"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The above output shows the first several rows of our distributed cuDF DataFrame.\n",
- "\n",
- "With our Dask cuDF DataFrame defined, we can now perform the same `sum` operation as we did with our cuDF DataFrame. The key difference is that this operation is now distributed - meaning we can perform this operation using multiple GPUs or even multiple nodes, each of which may have multiple GPUs. This allows us to scale to larger and larger amounts of data!"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "aggregation = distributed_df['value'].sum()\n",
- "print(aggregation.compute())"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "## Conclusion\n",
- "\n",
- "In this notebook, we showed at a high level what each of the packages in the RAPIDS are as well as what they do.\n",
- "\n",
- "To learn more about RAPIDS, be sure to check out: \n",
- "\n",
- "* [Open Source Website](http://rapids.ai)\n",
- "* [GitHub](https://github.com/rapidsai/)\n",
- "* [Press Release](https://nvidianews.nvidia.com/news/nvidia-introduces-rapids-open-source-gpu-acceleration-platform-for-large-scale-data-analytics-and-machine-learning)\n",
- "* [NVIDIA Blog](https://blogs.nvidia.com/blog/2018/10/10/rapids-data-science-open-source-community/)\n",
- "* [Developer Blog](https://devblogs.nvidia.com/gpu-accelerated-analytics-rapids/)\n",
- "* [NVIDIA Data Science Webpage](https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/)\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.7.3"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
diff --git a/getting_started_notebooks/intro_tutorials/06_Introduction_to_Supervised_Learning.ipynb b/getting_started_notebooks/intro_tutorials/06_Introduction_to_Supervised_Learning.ipynb
deleted file mode 100644
index 5844e3f6..00000000
--- a/getting_started_notebooks/intro_tutorials/06_Introduction_to_Supervised_Learning.ipynb
+++ /dev/null
@@ -1,838 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "## Introduction to Supervised Learning\n",
- "#### By Paul Hendricks\n",
- "-------\n",
- "\n",
- "In this notebook, we will show to do GPU accelerated Supervised Learning in RAPIDS. We will not cover SGD Regression at this time.\n",
- "\n",
- "**Table of Contents**\n",
- "\n",
- "* [Introduction to Supervised Learning](#introduction)\n",
- "* [Linear Regression](#linear)\n",
- "* [Ridge Regression](#ridge)\n",
- "* [K Nearest Neighbors](#knn)\n",
- "* [Setup](#setup)\n",
- "* [Conclusion](#conclusion)\n",
- "\n",
- "Before going any further, let's make sure we have access to `matplotlib`, a popular Python library for visualizing data."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {},
- "outputs": [],
- "source": [
- "import os\n",
- "import subprocess\n",
- "\n",
- "try:\n",
- " import matplotlib\n",
- "except ModuleNotFoundError:\n",
- " os.system('conda install -y matplotlib')\n",
- " import matplotlib\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "## Setup\n",
- "\n",
- "This notebook was tested using the following Docker containers:\n",
- "\n",
- "* `rapidsai/rapidsai-dev-nightly:0.12-cuda10.1-runtime-ubuntu18.04-py3.7` container from [DockerHub](https://hub.docker.com/r/rapidsai/rapidsai-nightly)\n",
- "\n",
- "This notebook was run on the NVIDIA GV100 GPU. Please be aware that your system may be different and you may need to modify the code or install packages to run the below examples. \n",
- "\n",
- "If you think you have found a bug or an error, please file an issue here: https://github.com/rapidsai/notebooks-contrib/issues\n",
- "\n",
- "Before we begin, let's check out our hardware setup by running the `nvidia-smi` command."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "!nvidia-smi"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Next, let's see what CUDA version we have:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "!nvcc --version"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Next, let's load some helper functions from `matplotlib` and configure the Jupyter Notebook for visualization."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {},
- "outputs": [],
- "source": [
- "from matplotlib.colors import ListedColormap\n",
- "import matplotlib.pyplot as plt\n",
- "\n",
- "\n",
- "%matplotlib inline"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "## Linear Regression\n",
- "\n",
- "After our data has been preprocessed, we often want to build a model so as to understand the relationships between different variables in our data. Scikit-Learn is an incredibly powerful toolkit that allows data scientists to quickly build models from their data. Below we show a simple example of how to create a Linear Regression model."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "NumPy Version: 1.17.5\n"
- ]
- }
- ],
- "source": [
- "import numpy as np; print('NumPy Version:', np.__version__)\n",
- "\n",
- "\n",
- "# create the relationship: y = 2.0 * x + 1.0\n",
- "\n",
- "n_rows = 46000\n",
- "w = 2.0\n",
- "x = np.random.normal(loc=0, scale=1, size=(n_rows,))\n",
- "b = 1.0\n",
- "y = w * x + b\n",
- "\n",
- "# add a bit of noise\n",
- "noise = np.random.normal(loc=0, scale=2, size=(n_rows,))\n",
- "y_noisy = y + noise"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can now visualize our data using the `matplotlib` library."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- ""
- ]
- },
- "execution_count": 4,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- "
"
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "plt.scatter(x, y_noisy, label='empirical data points')\n",
- "plt.plot(x, y, color='black', label='true relationship')\n",
- "plt.plot(inputs, outputs, color='red', label='predicted relationship (cpu)')\n",
- "plt.legend()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The mathematical operations underlying many machine learning algorithms are often matrix multiplications. These types of operations are highly parallelizable and can be greatly accelerated using a GPU. cuML makes it easy to build machine learning models in an accelerated fashion while still using an interface nearly identical to Scikit-Learn. The below shows how to accomplish the same Linear Regression model but on a GPU.\n",
- "\n",
- "First, let's convert our data from a NumPy representation to a cuDF representation."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "cuDF Version: 0.12.0b+1877.g8b04eb7\n",
- " x y\n",
- "0 -0.445580 2.929034\n",
- "1 1.065418 -0.256664\n",
- "2 -1.133438 -1.950435\n",
- "3 1.977738 6.854074\n",
- "4 3.121144 6.280575\n"
- ]
- }
- ],
- "source": [
- "import cudf; print('cuDF Version:', cudf.__version__)\n",
- "\n",
- "\n",
- "# create a cuDF DataFrame\n",
- "df = cudf.DataFrame({'x': x, 'y': y_noisy})\n",
- "print(df.head())"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Next, we'll load the GPU accelerated `LinearRegression` class from cuML, instantiate it, and fit it to our data."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 10,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "cuML Version: 0.12.0a+752.g564916e\n"
- ]
- }
- ],
- "source": [
- "import cuml; print('cuML Version:', cuml.__version__)\n",
- "from cuml.linear_model import LinearRegression as LinearRegression_GPU\n",
- "\n",
- "\n",
- "# instantiate and fit model\n",
- "linear_regression_gpu = LinearRegression_GPU()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 11,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "CPU times: user 535 ms, sys: 186 ms, total: 721 ms\n",
- "Wall time: 717 ms\n"
- ]
- },
- {
- "data": {
- "text/plain": [
- "LinearRegression(algorithm='eig', fit_intercept=True, normalize=False, handle=)"
- ]
- },
- "execution_count": 11,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "%%time\n",
- "\n",
- "linear_regression_gpu.fit(df['x'], df['y'])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can use this model to predict values for new data points, a step often called \"inference\" or \"scoring\". All model fitting and predicting steps are GPU accelerated."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 12,
- "metadata": {},
- "outputs": [],
- "source": [
- "# create new data and perform inference\n",
- "new_data_df = cudf.DataFrame({'inputs': inputs})\n",
- "outputs_gpu = linear_regression_gpu.predict(new_data_df[['inputs']])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Lastly, we can overlay our predicted relationship using our GPU accelerated Linear Regression model (green line) over our empirical data points (light blue circles), the true relationship (blue line), and the predicted relationship from a model built on the CPU (red line). We see that our GPU accelerated model's estimate of the true relationship (green line) is identical to the CPU based model's estimate of the true relationship (red line)!"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 13,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- ""
- ]
- },
- "execution_count": 13,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- "
"
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "plt.scatter(x, y_noisy, label='empirical data points')\n",
- "plt.plot(x, y, color='black', label='true relationship')\n",
- "plt.plot(inputs, outputs, color='red', label='predicted relationship (cpu)')\n",
- "plt.plot(inputs, outputs_gpu.to_array(), color='green', label='predicted relationship (gpu)')\n",
- "plt.legend()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "## Ridge Regression\n",
- "\n",
- "Ridge extends LinearRegression by providing L2 regularization on the coefficients when predicting response y with a linear combination of the predictors in X. It can reduce the variance of the predictors, and improves the conditioning of the problem.\n",
- "\n",
- "Below, we instantiate and fit a Ridge Regression model to our data."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 14,
- "metadata": {},
- "outputs": [],
- "source": [
- "from cuml.linear_model import Ridge as Ridge_GPU\n",
- "\n",
- "\n",
- "# instantiate and fit model\n",
- "ridge_regression_gpu = Ridge_GPU()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 15,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "CPU times: user 20.7 ms, sys: 20.5 ms, total: 41.2 ms\n",
- "Wall time: 40.4 ms\n"
- ]
- },
- {
- "data": {
- "text/plain": [
- "Ridge(alpha=1.0, solver='eig', fit_intercept=True, normalize=False, handle=)"
- ]
- },
- "execution_count": 15,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "%%time\n",
- "\n",
- "ridge_regression_gpu.fit(df[['x']], df['y'])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Similar to the `LinearRegression` model we fitted early, we can use the `predict` method to generate predictions for new data."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 16,
- "metadata": {},
- "outputs": [],
- "source": [
- "outputs_gpu = ridge_regression_gpu.predict(new_data_df[['inputs']])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Lastly, we can visualize our `Ridge` model's estimated relationship and overlay it our the empirical data points."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 17,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- ""
- ]
- },
- "execution_count": 17,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYYAAAD4CAYAAADo30HgAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8li6FKAAAgAElEQVR4nO3dd3hT1RvA8e9JKVD2lk1BgQItLaVlWPasMlRABJRZNjJ+CjJUliwFGSobBFRUhgqFskdlyShQEJRNgQKyqWW3yfn90RBb6AKa3o738zx5kpx77rlv2iRv7nqv0lojhBBCPGYyOgAhhBApiyQGIYQQMUhiEEIIEYMkBiGEEDFIYhBCCBFDBqMDeFH58uXTzs7ORochhBCpyv79+69rrfPHNi3VJwZnZ2eCgoKMDkMIIVIVpdS5uKbJpiQhhBAxSGIQQggRgyQGIYQQMaT6fQyxiYiIIDQ0lAcPHhgdikgDMmfOTNGiRXF0dDQ6FCGSRZpMDKGhoWTPnh1nZ2eUUkaHI1IxrTU3btwgNDSUkiVLGh2OEMkiTW5KevDgAXnz5pWkIF6YUoq8efPK2qdIV9JkYgAkKYgkI+8lkd6k2cQghBBp1b2IewzeOJhzt+M8FeGFSGKwg9u3bzNjxgyjwyAkJARXV9cE+/z444+250FBQfTr1y9J43B2dub69etPtfv7+zNhwoQkXZYQad3Ws1txm+nGF7u+YM3JNXZZhiQGO4gvMZjN5iRdVmRk5AvN/2Ri8PLy4quvvnrRsBKlefPmDBkyJFmWJURqF/YgjO6rulPvu3qYlInAjoH08u5ll2VJYrCDIUOGcPr0aTw8PBg0aBCBgYHUrVuXdu3a4ebm9tQv+UmTJjFy5EgATp8+ja+vL5UrV6ZmzZocO3bsqfFHjhxJ9+7dadSoER06dMBsNjNo0CC8vb2pWLEis2fPfmqekJAQatasiaenJ56enuzatcsW6/bt2/Hw8GDKlCkEBgbStGlTAG7evMmbb75JxYoVqVatGocPH7Ytv0uXLtSpU4dSpUrZEsndu3dp0qQJ7u7uuLq6smTJEtvyv/76azw9PXFzc7O9poULF/L+++8D0KlTJ3r27EnNmjUpU6YMq1evftF/gxBphv9xf8rPKM/8g/P56NWPONzzMLWda9tteWnycNXoBgwYQHBwcJKO6eHhwdSpU+OcPmHCBI4cOWJbbmBgIHv37uXIkSOULFmSkJCQOOft3r07s2bNonTp0uzZs4fevXuzZcuWp/rt37+fHTt24OTkxJw5c8iZMyf79u3j4cOH+Pj40KhRoxg7TQsUKMDGjRvJnDkzJ0+epG3btgQFBTFhwgQmTZpk+yIODAy0zTNixAgqVarEihUr2LJlCx06dLC9pmPHjrF161bCw8MpW7YsvXr1Yt26dRQuXJiAgAAAwsLCbGPly5ePAwcOMGPGDCZNmsS8efOeek0hISH8/vvvnD59mrp163Lq1CkyZ84c599KiLTu6t2r9FvbjyVHl+BWwI2VbVbiVdjL7stNksSglPoWaApc1Vq7WtvyAEsAZyAEaK21vmWdNhTwA8xAP631emt7ZWAh4ASsAfrrNHJR6ipVqiR4HPydO3fYtWsXb7/9tq3t4cOHsfZt3rw5Tk5OAGzYsIHDhw+zfPlyIOoL+eTJk5QpU8bWPyIigvfff5/g4GAcHBw4ceJEgjHv2LGDX375BYB69epx48YN25d9kyZNyJQpE5kyZaJAgQJcuXIFNzc3Bg4cyODBg2natCk1a9a0jdWiRQsAKleuzK+//hrr8lq3bo3JZKJ06dKUKlWKY8eO4eHhkWCcQqQ1Wmt+/PNH+q/rT/ijcD6r+xkf+XxERoeMybL8pFpjWAh8A3wXrW0IsFlrPUEpNcT6fLBSqjzQBqgAFAY2KaXKaK3NwEygO7CbqMTgC6x9kcDi+2WfnLJmzWp7nCFDBiwWi+3542PkLRYLuXLlStQaTvTxtNZ8/fXXNG7cOEaf6GsmU6ZM4aWXXuLQoUNYLJZE/RKPLSc/XgvJlCmTrc3BwYHIyEjKlCnD/v37WbNmDUOHDqVRo0YMHz48Rv/HfWPz5GGhcpioSI8uhF2gV0AvAk4GUK1oNeY3n0/5/OWTNYYk2cegtd4G3Hyi+Q1gkfXxIuDNaO0/a60faq3PAqeAKkqpQkAOrfUf1rWE76LNk6pkz56d8PDwOKe/9NJLXL16lRs3bvDw4UPbZpwcOXJQsmRJli1bBkR9MR86dCjB5TVu3JiZM2cSEREBwIkTJ7h7926MPmFhYRQqVAiTycT3339v2wkeX6y1atVi8eLFQNQmpnz58pEjR44447h06RJZsmThvffeY+DAgRw4cCDB2KNbtmwZFouF06dPc+bMGcqWLftM8wuRmlm0hVlBs6gwowJbQ7YytfFUdnTekexJAey7j+ElrfVlAK31ZaVUAWt7EaLWCB4LtbZFWB8/2f4UpVR3otYsKF68eBKH/eLy5s2Lj48Prq6uvPbaazRp0iTGdEdHR4YPH07VqlUpWbIkLi4utmmLFy+mV69ejBkzhoiICNq0aYO7u3u8y+vatSshISF4enqitSZ//vysWLEiRp/evXvTsmVLli1bRt26dW1rHBUrViRDhgy4u7vTqVMnKlWqZJtn5MiRdO7cmYoVK5IlSxYWLVpEfP78808GDRqEyWTC0dGRmTNnJurv9VjZsmWpXbs2V65cYdasWbJ/QaQbJ2+cpOuqrmw7t40GpRowp+kcSuY2rgSLSqpN+EopZ2B1tH0Mt7XWuaJNv6W1zq2Umg78obX+wdo+n6jNRueB8VrrBtb2msBHWutm8S3Xy8tLP3mhnr///pty5colyesSyaNTp040bdqUVq1aGR1KrOQ9Jewh0hLJ5D8mMyJwBJkcMjG58WQ6e3ROls2oSqn9WutY92Tbc43hilKqkHVtoRBw1doeChSL1q8ocMnaXjSWdiGESHMO/XMIP38/9l/ez5subzL99ekUzl7Y6LAA+yYGf6AjMMF6vzJa+49KqclE7XwuDezVWpuVUuFKqWrAHqAD8LUd4xMpyMKFC40OQYhk8TDyIWO2jWHCzgnkccrD0lZLaVW+VYo62CKpDlf9CagD5FNKhQIjiEoIS5VSfkRtJnobQGt9VCm1FPgLiAT6WI9IAujFf4erruUFj0gSQoiU5I8Lf+Dn78ff1/+mg3sHJjeaTN4seY0O6ylJkhi01m3jmFQ/jv5jgbGxtAcB8Rf3EUKIVObuo7t8vOVjvtrzFcVyFmPtu2vxfcXX6LDilObPfBZCCCNtOrOJbqu6EXI7hD7efRhffzzZM2U3Oqx4SWIQQgg7uHX/FgM3DOTb4G8pk7cM2zpto2aJmgnPmAJIET07yZYtGxB10ldKPQQzub366qtJNtaAAQPYtm1bko0HMHDgwFjrUgnxrH77+zfKzyjPokOLGOIzhEM9D6WapACSGOyucOHCthpG9hJf6e0XLcsNSVcq/HFF1xd18+ZNdu/eTa1atZJkvMf69u0r14cQL+TKnSu0XtaaFktbUDBbQfZ228v4BuPJnCF1nawpicHOopfYXrhwIS1atMDX15fSpUvz0Ucf2fpt2LCB6tWr4+npydtvv82dO3cAGD16NN7e3ri6utK9e3db/aI6deowbNgwateuzbRp02IsM7FluS0WC71796ZChQo0bdqU119/3ZbEnJ2dGT16NDVq1GDZsmVxlgNftmwZrq6uuLu7276ojx49SpUqVfDw8KBixYqcPHkS+G8tSmvNoEGDcHV1xc3NzVaeOzAwkDp16tCqVStcXFx49913Y63XtHz5cnx9/9txt2/fPl599VXc3d2pUqUK4eHhLFy4kDfeeANfX1/Kli3LqFGjnvp/QMyS5yVKlODGjRv8888/z/ZPFume1prvDn1HuenlWHl8JWPrjWVv1714FvI0OrTnkvb3MQwYAElcdhsPD3jO4nzBwcEcPHiQTJkyUbZsWfr27YuTkxNjxoxh06ZNZM2alc8//5zJkyczfPhw3n//fVshuvbt27N69WqaNYs6Gfz27dv8/vvvsS4nMWW59+/fT0hICH/++SdXr16lXLlydOnSxTZG5syZ2bFjBwD169ePtRz46NGjWb9+PUWKFOH27dsAzJo1i/79+/Puu+/y6NGjp9Y4fv31V4KDgzl06BDXr1/H29vbllQOHjzI0aNHKVy4MD4+PuzcuZMaNWrEmH/nzp22zXOPHj3inXfeYcmSJXh7e/Pvv//aqs4+LnWeJUsWvL29adKkCfny5Yv3/+Pp6cnOnTtp2bJlwv9MIYDzYefpsboH606t49VirzK/+Xxc8rkkPGMKlvYTQwpTv359cubMCUD58uU5d+4ct2/f5q+//sLHxweI+rKrXr06AFu3buWLL77g3r173Lx5kwoVKtgSwzvvvBPnchJTlnvHjh28/fbbmEwmChYsSN26dWOM8Xj8+MqB+/j40KlTJ1q3bm0rrV29enXGjh1LaGgoLVq0oHTp0jHG3bFjB23btsXBwYGXXnqJ2rVrs2/fPnLkyEGVKlUoWjTqBHgPDw9CQkKeSgyXL18mf/78ABw/fpxChQrh7e0NEKPIX8OGDcmbN+oY8RYtWrBjxw7efDP+uowFChTg0iU54V4kzKItzNw3kyGbh0RVOH7ta3p798akUv+GmLSfGFJI2e3HYitXrbWmYcOG/PTTTzH6PnjwgN69exMUFESxYsUYOXKkrUQ3xCy9/aTElOV+fEGdhMaIrxz4rFmz2LNnDwEBAXh4eBAcHEy7du2oWrUqAQEBNG7cmHnz5lGvXr0Y8cQltr/Pk5ycnGx/B611nGeMxlbGO66S59GfP06oQsTl+PXjdF3VlR3nd9Do5UbMbjob51zORoeVZFJ/aksDqlWrxs6dOzl16hQA9+7d48SJE7YvrXz58nHnzp3n3okdV1nuGjVq8Msvv2CxWLhy5UqMq7dFF1858NOnT1O1alVGjx5Nvnz5uHDhAmfOnKFUqVL069eP5s2b2y4J+litWrVYsmQJZrOZa9eusW3bNqpUqZLo11OuXDnb38rFxYVLly6xb98+AMLDw23JZOPGjdy8eZP79++zYsUKfHx84ix5/tiJEydi7IMQIroIcwQTdkzAfZY7R68eZeEbC1n37ro0lRQgPawxpAL58+dn4cKFtG3b1raJZsyYMZQpU4Zu3brh5uaGs7OzbXPJs4qrLHfLli3ZvHkzrq6ulClThqpVq9o2cz0prnLggwYN4uTJk2itqV+/Pu7u7kyYMIEffvgBR0dHChYsaNtH8thbb73FH3/8gbu7O0opvvjiCwoWLBjr9a1j06RJE2bPnk3Xrl3JmDEjS5YsoW/fvty/fx8nJyc2bdoEQI0aNWjfvj2nTp2iXbt2eHlFFZKMq+R5REQEp06dsvUTIrqDlw/i5+/HwX8O0rJcS755/RsKZitodFj2obVO1bfKlSvrJ/31119PtYnYhYeHa621vn79ui5VqpS+fPmywREljo+Pj75161ac0xcsWKD79OnzTGP++uuv+pNPPol1mryn0q/7Eff1sE3DtMMoB/3SxJf08qPLjQ4pSQBBOo7vVVljSOeaNm3K7du3efToEZ9++ikFC6aOX0Bffvkl58+fJ1euXAl3TqTIyEg+/PDDJBtPpH47z+/Ez9+P4zeO09mjM5MaTSKPUx6jw7K7JLtQj1HkQj0iOch7Kn0JfxjOsM3DmL5vOsVzFmdOszk0ermR0WElKaMu1COEEKnO+lPr6b66OxfCLtC3Sl/G1h9LtozZjA4rWUliEEII4Ob9m3yw/gMWHVqESz4Xtnfejk9xH6PDMoQkBiFEuvfLX7/QZ00frt+7zsc1P+aTWp+kuvpGSUkSgxAi3bocfpn3177Pr3//imchT9a9tw6Pgh5Gh2U4OcEtmbz++uu2WkLRjRw5kkmTJhkQUfIYPny47byCF3Xw4EG6du2aJGNFt3r1akaMGJHk44qUS2vNwuCFlJ9RnoATAUyoP4E9XfdIUrCSxGBnWmssFgtr1qxJ0kMrE7PMF5EU5bohqjpsgwYNkmSscePG0bdv3yQZK7omTZrg7+/PvXv3knxskfKE3A6h8Q+N6byyM24F3Djc6zCDawwmg0k2oDwmicEOQkJCKFeuHL1798bT05MLFy7g7OzM9evXARg7dixly5alQYMGHD9+3Dbfvn37qFixItWrV7eVpQbiLJud0DLjKuW9Zs0aXFxcqFGjBv369aNp06ZA4st1X758mVq1auHh4YGrqyvbt2/HbDbTqVMnWyntKVOmANCpUydbKY/NmzdTqVIl3Nzc6NKli+0sb2dnZ0aMGIGnpydubm6xngEdHh7O4cOHcXd3B+DatWs0bNgQT09PevToQYkSJbh+/TohISG4uLjQsWNHKlasSKtWrWxf+NH/B0FBQdSpUweIqqFUp06dp8pjiLTFbDHz1Z6vcJ3hyh+hfzD99ekEdgqkTN4yRoeW4qT5FDlg3QCC/0nastseBT2Y6ht/cb7jx4+zYMECZsyYEaN9//79/Pzzzxw8eJDIyEg8PT2pXLkyAJ07d2bOnDm8+uqrDBkyxDbP/PnzYy2bXbJkyTiXef369VhLeX/00Uf06NGDbdu2UbJkSdq2bftUfAmV6/71119p3LgxH3/8MWazmXv37hEcHMzFixc5cuQIwFObzR48eECnTp3YvHkzZcqUoUOHDsycOZMBAwYAUfWgDhw4wIwZM5g0aRLz5s2LMX9QUFCMGkajRo2iXr16DB06lHXr1jFnzpwYf4f58+fj4+NDly5dmDFjBgMHDoz3/+Xl5cX27dtp3bp1vP1E6vT3tb/puqoruy7swvcVX2Y3nU3xnMWNDivFkjUGOylRogTVqlV7qn379u289dZbZMmShRw5ctC8eXMg6os0PDzcdvnLdu3a2ebZsGED3333HR4eHlStWpUbN27YLn4T1zJ3795tK+Xt4eHBokWLOHfuHMeOHaNUqVK2pPJkYniyXHdsy/X29mbBggWMHDmSP//8k+zZs1OqVCnOnDlD3759WbduXYzy1xD1ZV2yZEnKlIn6ddaxY8cYl+Z8XLK7cuXKhISEPPXaopfahqjS3W3atAHA19eX3Llz26YVK1bMVsL8vffes11TIj5SbjttijBHMHbbWDxme3Ds+jG+e/M71rRbI0khAWl+jSGhX/b2El9J7NjKRMd3BrqOo2x2fMvUcZTyPnjw4DONEddyt23bRkBAAO3bt2fQoEF06NCBQ4cOsX79eqZPn87SpUv59ttvE/X64L9y24kptZ3QeLGV2wZilNyWcttp3/5L+/Hz9+PQlUO0rtCar3y/4qVsLxkdVqogawzJrFatWvz222/cv3+f8PBwVq1aBUDu3LnJnj07u3fvBuDnn3+2zRNX2ez4xFXK28XFhTNnzth+lT++rGZs4lruuXPnKFCgAN26dcPPz48DBw5w/fp1LBYLLVu25LPPPuPAgQMxxnJxcSEkJMQWz/fff0/t2rUT+2eLUWoboiqnLl26FIhas7l165Zt2vnz5/njjz8A+Omnn2wX+nF2dmb//v0A/PLLLzHGl3Lbacf9iPsM2TSEqvOqcvXuVX575zeWtFoiSeEZpPk1hpTG09OTd955Bw8PD0qUKEHNmjVt0+bPn0+3bt3ImjUrderUsZXAjqtsdnziK+U9Y8YMfH19yZcvX7zXQYhruYGBgUycOBFHR0eyZcvGd999x8WLF+ncubPtF/n48eNjjJU5c2YWLFjA22+/TWRkJN7e3vTs2TPRfzcXFxfCwsIIDw8ne/bsjBgxgrZt27JkyRJq165NoUKFyJ49O3fu3KFcuXIsWrSIHj16ULp0aXr16gXAiBEj8PPzY9y4cVStWjXG+Fu3bn0qZpH6bDu3ja7+XTl58yR+lfyY1GgSuTInz9GAaUpcZVdTyy0tld1+XAJba63Hjx+v+/XrZ9flWCwW3atXLz158mS7LCepTZ48Wc+dO1drrfWDBw90RESE1lrrXbt2aXd3d6211mfPntUVKlR4pnH/+ecfXa9evXj7pNb3VHoR9iBM917dWzMSXXJqSb3p9CajQ0rxkLLbqUNAQADjx48nMjKSEiVKsHDhQrssZ+7cuSxatIhHjx5RqVIlevToYZflJLVevXrZriJ3/vx5WrdujcViIWPGjMydO/e5xz1//jxffvllUoUpktnak2vpsboHof+GMqDqAMbUG0PWjHHv4xMJk7LbQiSCvKdSnhv3bvC/9f/j+8PfUz5/eeY3n0+1ok8fCShiJ2W3hRBphtaaZX8t4/0173PrwS2G1xrOsJrDyJQhk9GhpRmSGIQQqcal8Ev0DujNyuMr8Srsxabmm6j4UkWjw0pzJDEIIVI8rTXfHvyWDzd8yEPzQyY2nMiAagOkvpGdyF9VCJGinbl1hm6rurHl7BZql6jNvObzeCXPK0aHlabJCW6pzOOSGc86LT6JLf2dLVv8lze8ffv2U7Wh7CExpbwDAwPZtWuX3WMR9mO2mJm6eypuM93Yd3Efs5rMYkvHLZIUkoGsMaQysX3Zmc1mHBwcDP8ifJwYevfubdfljB49OsE+gYGBZMuW7bmTpTDW0atH8fP3Y8/FPTQp3YRZTWdRNEdRo8NKN2SNAVhx8CI+E7ZQckgAPhO2sOLgxRce84cffqBKlSp4eHjQo0cPzGYzEPWre/DgwVSuXJkGDRqwd+9e6tSpQ6lSpfD39wdg4cKFvPHGG/j6+lK2bFlGjRplG/fxr/bAwEDq1q1Lu3btcHNzizEN4IsvvsDNzQ13d3dbpda5c+fi7e2Nu7s7LVu2TPD6A2fPnqV69ep4e3vz6aef2trv3LlD/fr1bWWyV65cCcCQIUM4ffo0Hh4eDBo0KM5+T8qWLRsffvghnp6e1K9fn2vXrgEQHBxMtWrVqFixIm+99Zat7EX0Ut6xlewOCQlh1qxZTJkyBQ8PD7Zv386yZctwdXXF3d2dWrVqJeZfKAzwyPyI0b+PptLsSpy6eYrFLRazqu0qSQrJLa4z31LL7UXPfP7tQKh2+WStLjF4te3m8sla/duB0ESPEdvymzZtqh89eqS11rpXr1560aJFWmutAb1mzRqttdZvvvmmbtiwoX706JEODg62nb27YMECXbBgQX39+nV97949XaFCBb1v3z6ttdZZs2bVWmu9detWnSVLFn3mzBnbch9PW7Nmja5evbq+e/eu1lrrGzduaK21vn79uq3vxx9/rL/66iuttdYjRozQEydOfOp1NGvWzBb3N998Yxs/IiJCh4WFaa21vnbtmn755Ze1xWJ56qzjuPo9CdA//PCD1lrrUaNG6T59+mittXZzc9OBgYFaa60//fRT3b9/f6211h07dtTLli3TWmtdokQJ2+uYPn269vPzi/U1ubq66tDQqP/prVu3noohIXLms/3tDd2r3Wa4aUai2y5vq6/euWp0SGka8Zz5nO7XGCauP879CHOMtvsRZiauPx7HHAnbvHkz+/fvx9vbGw8PDzZv3syZM2cAyJgxI76+vgC4ublRu3ZtHB0dcXNzi1FuumHDhuTNmxcnJydatGgRa+noKlWqPHVNBoBNmzbRuXNnsmTJAkCePHkAOHLkCDVr1sTNzY3Fixdz9OjReF/Hzp07bWW527dvb2vXWjNs2DAqVqxIgwYNuHjxIleuXHlq/sT2M5lMvPPOO8B/ZbLDwsK4ffu2rdDek2W6o0uoZDeAj48PnTp1Yu7cuba1N5Ey3Iu4x6ANg6g2vxo379/Ev40/P7b8kfxZ8yc8s7ALu+9jUEqFAOGAGYjUWnsppfIASwBnIARorbW+Ze0/FPCz9u+ntV5vz/gu3b7/TO2JobWmY8eOsRZlc3R0tJWBNplMtnLTJpMpRrnpuEpHRxdXaW+tdaz9O3XqxIoVK3B3d2fhwoUEBgYm+FpiG2fx4sVcu3aN/fv34+joiLOz81NlrJ+lX2KWGZ+ESnYDzJo1iz179hAQEICHhwfBwcHkzZv3mZYjkl5gSCDdVnXj1M1TdPfszhcNvyBn5pxGh5XuJdcaQ12ttYf+7/TrIcBmrXVpYLP1OUqp8kAboALgC8xQSjnYM7DCuWKvwR9Xe2LUr1+f5cuXc/XqVQBu3rzJuXPnnmmMjRs3cvPmTe7fv8+KFStsF55JjEaNGvHtt9/a9iHcvHkTiLo8ZqFChYiIiGDx4sUJjuPj42Mr/x29f1hYGAUKFMDR0ZGtW7faXlv27NkJDw9PsN+TLBaLbZ/Bjz/+SI0aNciZMye5c+dm+/btwLOX6X4yltOnT1O1alVGjx5Nvnz5uHDhQqLHEkkv7EEYPVf3pO6iumit2dJhC7ObzZakkEIYdVTSG0Ad6+NFQCAw2Nr+s9b6IXBWKXUKqAL8Ya9ABjUuy9Bf/4yxOcnJ0YFBjcs+95jly5dnzJgxNGrUCIvFgqOjI9OnT6dEiRKJHqNGjRq0b9+eU6dO0a5dO7y8Yi1pEitfX1+Cg4Px8vIiY8aMvP7664wbN47PPvuMqlWrUqJECdzc3GJ8ccZm2rRptGvXjmnTptGyZUtb+7vvvkuzZs3w8vLCw8MDFxcXAPLmzYuPjw+urq689tprDB48ONZ+T8qaNStHjx6lcuXK5MyZ03aNiEWLFtGzZ0/u3btHqVKlWLBgQaL/Bs2aNaNVq1asXLmSr7/+milTpnDy5Em01tSvX9927WiR/FafWE3P1T25fOcyH1b/kNF1R5PFMYvRYYlo7F5ETyl1FrgFaGC21nqOUuq21jpXtD63tNa5lVLfALu11j9Y2+cDa7XWy58YszvQHaB48eKVn/wl+qwFz1YcvMjE9ce5dPs+hXM5MahxWd6sVOT5XnASWLhwIUFBQXzzzTeGxZCcsmXLxp07d4wOI15SRO/FXbt7jf7r+vPTkZ9wLeDK/ObzqVIk7uuBCPsyuoiej9b6klKqALBRKXUsnr6xbVx+KnNprecAcyCquuqLBvhmpSKGJgIh0jKtNT8f+Zl+6/oR9iCMUXVGMaTGEDI6ZDQ6NBEHuycGrfUl6/1VpdRvRG0auqKUKqS1vqyUKgRctXYPBYpFm70okO6u0N6pUyc6depkdBDwGMkAAByHSURBVBjJJqWvLYjnF/pvKL0CerH6xGqqFKnC/ObzcS0gl1BN6ey681kplVUplf3xY6ARcATwBzpau3UEHp/55A+0UUplUkqVBEoDe59n2fbeRCbSD3kvPTuLtjBn/xwqzKjA5jObmdxoMru67JKkkErYe43hJeA36+GHGYAftdbrlFL7gKVKKT/gPPA2gNb6qFJqKfAXEAn00Vo/80HnmTNn5saNG+TNm/eZD30UIjqtNTdu3CBz5sxGh5JqnLp5im6ruhEYEkhd57rMbTaXl/O8bHRY4hmkySu4RUREEBoamqhj5oVISObMmSlatCiOjo5Gh5KiRVoimbZ7Gp9u/RRHB0e+bPQlfpX85MdZCmX0zudk5+joGOsZwUII+/jzyp/4+fux79I+mpdtzozXZ1AkhxzQkVqlycQghEgeDyMfMm77OMbtGEfuzLn5ueXPtK7QWtYSUjlJDEKI57IndA9+/n4cvXaU9yq+x5TGU8iXJZ/RYYkkIIlBCPFM7j66y6dbP2Xq7qkUyVGE1W1X06RME6PDEklIEoMQItG2nN1Ct1XdOHPrDL28ejGhwQRyZMphdFgiiUliEEIk6PaD2wzaMIh5B+dROk9pAjsGUts58UUNReoiiUEIEa+Vx1bSK6AXV+5e4aNXP2JknZE4OT5/9WGR8kliEELE6urdq/Rb248lR5dQ8aWK+Lf1x6tw4qv8itRLEoMQIgatNYv/XEz/df258+gOn9X9jME+g3F0kBP80gtJDEIImwthF+gZ0JM1J9dQrWg15jefT/n85Y0OSyQzSQxCCCzawuyg2QzeNBizNjO18VTer/I+Dia7XkBRpFCSGIRI507cOEFX/65sP7+dBqUaMKfpHErmlpIy6ZkkBiHSqUhLJJP/mMyIwBFkzpCZb5t/SyePTlLOQkhiECI9OvTPIbr4d+HA5QO85fIW01+fTqHshYwOS6QQkhiESEceRj5kzLYxTNg5gTxOeVj29jJalmspawkiBkkMQqQTuy7soqt/V/6+/jcd3DswudFk8mbJa3RYIgWSxCBEGnfn0R0+3vwxX+/9mmI5i7H23bX4vuJrdFgiBZPEIEQatvH0Rrqv7k7I7RDe936fcfXHkT1TdqPDEimcJAYh0qBb92/x4YYPWRC8gLJ5y7K983ZqFK9hdFgilZDEIEQa89vfv9F7TW+u3b3G0BpDGV57OJkzZDY6LJGKSGIQIo34584/9F3bl+V/LcejoAcB7QLwLORpdFgiFZLEIEQqp7Xmu0Pf8b/1/+NexD3G1RvHwFcHStE78dwkMQiRip27fY4eq3uw/vR6fIr5MK/5PFzyuRgdlkjlJDEIkQpZtIUZ+2YwZNMQAL5+7Wt6e/fGpEwGRybSAkkMQqQyx68fx8/fj50XdtL45cbMbjqbErlKGB2WSEMkMQiRSkSYI5i0axKjfh9FFscsLHxjIR3cO0g5C5HkJDEIkQocvHyQLv5dCP4nmFblW/H1a19TMFtBo8MSaZQkBiFSsAeRDxgVOIqJuyaSL0s+fmn9Cy3KtTA6LJHGSWIQIoXacX4Hfv5+nLhxgs4enfmy0ZfkdsptdFgiHZDEIEQKE/4wnKGbhzJ933Scczmz4b0NNHy5odFhiXREEoMQKcj6U+vpvro7F8Iu0K9KP8bWH0u2jNmMDkukM5IYhEgBbt6/yf/W/4/vDn2HSz4XdnTZwavFXjU6LJFOSWIQwkBaa375+xf6rOnDzfs3+bjmx3xS6xMpeicMJYlBCINcDr9MnzV9+O3Yb3gW8mT9e+vxKOhhdFhCSGIQIrlprVkYvJAPNnzAg8gHfN7gcz6o/gEZTPJxFCmDvBOFSEZnb52l++rubDqziZrFazKv+TzK5C1jdFhCxCCJQYgnrDh4kYnrj3Pp9n1yZXHkYYSZexEWAHI5OTKyeQXerFTE1v+TFX+yeM95tI57TI2ZcIcAbjsuAkzkiejNueO+NJp4Ejhp3xckEpTLyRGl4Na9CBSgn5gW/X/+5PtDawi7H0HhXE7UdcnP1mPXuHT7PoVzOTGocdkY75XUQun43s0GUEr5AtMAB2Ce1npCfP29vLx0UFBQssSWUjx+Y168fT/W6VkzOuDoYOL2/aff5CL5RagL3HCcxkOHY2Q2VyZvRB8y6AJGhyWegaNJMfFtd4LO3WTx7vOJ/kw5OTowvoVbikwOSqn9Wmuv2KalqDUGpZQDMB1oCIQC+5RS/lrrv4yNLPGif2k7KIVZa4pE+yXxZPugxmWferNlzejA2Lei3kyfrPiTH/ecx/IM3+53H5kBMyBJwUiaSMIyLCcsw8+YcCLvow/Jaq6DIvUVvdNag8WMyRwB5kiUxYwp8hFYzChzJFgiMEVGgjajIiNQ1naTJQIsFlRkBGgzJnMkWCJRZjPKEomyjqUsZkyWSNt4SltiTIu6RYLZjElbnmi33rQZk9mC0maUxRI1vrZYH/83X8a8xTBlyIhJW8BiwaSjbljvlUVHzad11HPr/caZGm2x0MTWV//XBx01HxrHrLnIXr4OJgdHTNrClb3LoGZJ0Boslv/uoz9O6D6uae+9B3XqJPn/O0WtMSilqgMjtdaNrc+HAmitx8c1T0paY1hx8CJDf/2T+xHmRM/jYFKYn/zW1xoHZaFKsezsOXkt6g1vMUd9wCyRmMzmqA+ZxRz1YbXeK7M5Rl9T9A9MtPmVJfqHJ9qHRlv7my0obcGkreM87qstmCwW67It1g9P1Bj/fcjMKK3/a9PmqA+Qrb/+70MYo83834fMolFEtUfdHn94/3tuGweNyfqBfNwHrTFh/dACCmK9j2/as/SJre+ZQjD9DQgpCDWPQI+1kOfu84+X1PE9Tx+RhJSKuplMCd/HN238eGjf/jlDSCVrDEAR4EK056FA1Sc7KaW6A90Bihcv/nxL+vFHmD792bJzAn1f/fcBgRYL+uFdVMTD5/rwOjzfqxHxiEofMe9ja3uePk/2vZcBZtSBRa9GJYIvf4LaxxMez5xM8SVV32QZz2RCKxNaOfz32OSAxXr/+Lk2OYDJhEU5oB2s96bHNxPalMHaxwGLyYRDrkKQ0ck6ZtR4mBywODjEXN7jcU0OWFQG2+P/+jrE6BM1vgM6Q2ZMTtmxWL/8C+bKwu+D6/33hf74loKltMQQ21/rqVUarfUcYA5ErTE815IyZIAsWZ4tOyfQd+v+i1iAiH+vci8k+MU/HEqhUdY3q7I9t5j+a7eoqGVbovexPY5KORbTf88fT9PKhMVkAusYUW/iqDe4to1hsn54lPXDZoo5n8nBem+K0d/2wXw8v/UD83hZjz+wtjaHDBC9PcaH0PrBNGWAGB/2x8vIgHb478OvTSa0g2NUnA4ZUKbkS7UPTEe44fgVkaZLZItshFOGLnz1Zja+SrYIhL04mhRZM2Xg9v2IZ5rPydGB/zVxBcfUdf3tlJYYQoFi0Z4XBS7ZZUmtW0fdktBXE7bEuUNYJL/k+k1m4R63HBdyJ8MaMlheosDDMThZ5ES11Cquo5KAWDcVZ3E0cT/CkqaOSkppiWEfUFopVRK4CLQB2hkbUuINalw2afYxWNurlczNztM3kzJEkcTum/Zxw3E6ZnWD7JFvkCuiPSaknEVqULpAVu49sjzzl/jjQ1VT8xd/QlJUYtBaRyql3gfWE7W5/Vut9VGDw0q0x28Qo49KEvZnJoxbjnO5myEQR0tx8j8cQibtkiRj58jkwL8P4/5x4aDAbH0/RD/GPvrx9U9+acU3TSTem5WKpIu/W4o6Kul5pKSjkkTap7Vm6dGl9F3bl1sPbjGsxjCG1RxGpgyZjA5NiGeSmo5KEiLFuhR+iV4BvfA/7o9XYS82N9+M20tuRoclRJKTxCBEArTWzD84n4EbBvLQ/JBJDSfRv1p/KXon0ix5ZwsRjzO3ztBtVTe2nN1C7RK1mdd8Hq/kecXosISwK0kMQsTCbDHz1Z6v+HjLx2QwZWB209l09eyKSck5wCLtk8QgxBOOXD2Cn78fey/upUnpJsxqOouiOYoaHZYQyUYSgxBWj8yPGL99PGO3jyVn5pz82OJH2ri2QaXw8gVCJDVJDEIA+y7uo4t/F45cPUI7t3ZMbTyV/FnzGx2WEIaQxCDStXsR9xi+dThTdk+hULZC+Lfxp1nZZkaHJYShJDGIdGvr2a10W9WN07dO06NyDz5v8Dk5M+c0OiwhDCeJQaQ7YQ/C+GjjR8w5MIeXc7/Mlg5bqFuyrtFhCZFiSGIQ6cqq46voGdCTf+78w8DqAxlVdxRZHLMYHZYQKYokBpEuXLt7jf7r+vPTkZ9wK+DGindW4F3E2+iwhEiRJDGINE1rzU9HfqLf2n78+/BfRtUZxZAaQ8jokNHo0IRIsSQxiDQr9N9QegX0YvWJ1VQtUpX5zedToUAFo8MSIsWTxCDSHIu2MHf/XAZtHESkJZLJjSbTr2o/HJLxMp9CpGaSGESacvLGSbqt6sbv536nXsl6zG02l1K5SxkdlhCpiiQGkSZEWiKZunsqn279lEwOmZjXbB5dKnWRchZCPAdJDCLVO3zlMH7+fgRdCuKNsm8wo8kMCmcvbHRYQqRakhhEqvUw8iHjto9j3I5x5M6cmyWtlvB2+bdlLUGIFySJQaRKu0N34+fvx1/X/uK9iu8xtfFU8mbJa3RYQqQJkhhEqnL30V0+2fIJ0/ZMo0iOIgS0C+D10q8bHZYQaYokBpFqbD6zmW6runH29ll6efViQoMJ5MiUw+iwhEhzJDGIFO/2g9sM3DCQ+QfnUzpPaX7v9Du1StQyOiwh0ixJDCJFW3lsJb0CenH17lUG+wxmRO0RODk6GR2WEGmaJAaRIl25c4V+6/qx9OhS3F9yZ1XbVVQuXNnosIRIFyQxiBRFa80Ph39gwPoB3Hl0hzF1x/CRz0c4OjgaHZoQ6YYkBpFinA87T8/VPVl7ai3Vi1ZnfvP5lMtfzuiwhEh3JDEIw1m0hVlBsxi8aTAWbWGa7zT6ePeRondCGEQSgzDUiRsn6Orfle3nt9OwVENmN51NydwljQ5LiHRNEoMwRKQlki93fcmIwKijjBa8sYCO7h2lnIUQKYAkBpHsDv1ziC7+XThw+QBvubzF9NenUyh7IaPDEkJYSWIQyeZB5APGbBvD5zs/J69TXpa/vZyW5VsaHZYQ4gmSGESy2HVhF37+fhy7foyO7h2Z3HgyeZzyGB2WECIWkhiEXd15dIdhm4fxzd5vKJazGOveXUfjVxobHZYQIh6SGITdbDi9ge6runM+7Dx9vPswrv44smfKbnRYQogESGIQSe7W/Vt8sOEDFgYvpGzesmzrvI0axWsYHZYQIpEkMYgk9evfv9JnTR+u3b3G0BpDGV57OJkzZDY6LCHEMzDZa2Cl1Eil1EWlVLD19nq0aUOVUqeUUseVUo2jtVdWSv1pnfaVkoPaU41/7vxDq6WtaLm0JQWzFWRft32Mqz9OkoIQqZC91ximaK0nRW9QSpUH2gAVgMLAJqVUGa21GZgJdAd2A2sAX2CtnWMUL0BrzaJDi/hg/Qfci7jHuHrjGPjqQCl6J0QqZsSmpDeAn7XWD4GzSqlTQBWlVAiQQ2v9B4BS6jvgTSQxpFght0PosboHG05vwKeYD/Oaz8Mln4vRYQkhXpDdNiVZva+UOqyU+lYpldvaVgS4EK1PqLWtiPXxk+1PUUp1V0oFKaWCrl27Zo+4RTws2sLXe77GdYYruy7s4pvXvmFb522SFIRII15ojUEptQkoGMukj4naLPQZoK33XwJdgNj2G+h42p9u1HoOMAfAy8sr1j7CPo5dP0ZX/67svLCTxi83ZnbT2ZTIVcLosIQQSeiFEoPWukFi+iml5gKrrU9DgWLRJhcFLlnbi8bSLlKACHMEE3dNZNTvo8jqmJVFby6ifcX2UvROiDTInkclRa+K9hZwxPrYH2ijlMqklCoJlAb2aq0vA+FKqWrWo5E6ACvtFZ9IvAOXD1BlXhU+3vIxzcs25+8+f9PBvYMkBSHSKHvufP5CKeVB1OagEKAHgNb6qFJqKfAXEAn0sR6RBNALWAg4EbXTWXY8G+h+xH1G/z6aibsmkj9rfn5p/QstyrUwOiwhhJ0prVP3JnovLy8dFBRkdBhpzo7zO/Dz9+PEjRN08ejCpEaTyO2UO+EZhRCpglJqv9baK7ZpcuaziCH8YThDNw9l+r7pOOdyZmP7jTQolahdSUKINEISg7BZe3ItPVb3IPTfUPpX7c+YemPIljGb0WEJIZKZJAbBjXs3+N/6//H94e8pl68cO7vspHqx6kaHJYQwiCSGdExrzfK/lvP+2ve5ef8mn9T8hE9qfUKmDJmMDk0IYSBJDOnU5fDL9F7TmxXHVlC5UGU2vLcB94LuRoclhEgBJDGkM1prFgQv4IP1H/DQ/JAvGnzB/6r/jwwmeSsIIaLIt0E6cvbWWbqv7s6mM5uoVaIWc5vNpUzeMkaHJYRIYSQxpANmi5lv9n7DsC3DcFAOzGwyk+6Vu2NS9q6hKIRIjSQxpHF/XfsLP38/dofu5rVXXmN209kUy1ks4RmFEOmWJIY06pH5EZ/v+Jwx28eQPWN2fnjrB9q5tZP6RkKIBEliSIOCLgXh5+/H4SuHaePahmm+0yiQtYDRYQkhUglJDGnI/Yj7jAgcwZd/fEnBbAVZ2WYlzcs2NzosIUQqI4khjfg95He6rurKqZun6ObZjS8afkGuzLmMDksIkQpJYkjl/n34L4M3DmbW/lmUyl2KzR02U69kPaPDEkKkYpIYUrGAEwH0DOjJpfBLfFDtA0bXHU3WjFmNDksIkcpJYkiFrt+7zoB1A1j852LK5y/P8reXU7VoVaPDEkKkEZIYUhGtNUuOLqHv2r6EPQhjRO0RDK0xVIreCSGSlCSGVOLivxfpvaY3/sf98S7szfzm83F7yc3osIQQaZAkhhROa828A/MYuHEgEeYIJjWcxIBqA3AwORgdmhAijZLEkIKdvnmabqu6sTVkK3Wc6zC32VxeyfOK0WEJIdI4SQwpkNliZtqeaXyy5RMcHRyZ3XQ2XT27StE7IUSykMSQwhy5egQ/fz/2XtxL0zJNmdlkJkVzFDU6LCFEOiKJIYV4ZH7E+O3jGbt9LDkz5+Snlj/xToV3pOidECLZSWJIAfZe3Iufvx9Hrh6hnVs7pvlOI1+WfEaHJYRIpyQxGOhexD0+3fIpU/dMpVC2Qqxqu4qmZZoaHZYQIp2TxGCQrWe30nVVV87cOkOPyj34vMHn5Myc0+iwhBBCEkNyC3sQxqCNg5h7YC4v536ZrR2jDkUVQoiUQhJDMlp1fBU9A3ryz51/GFh9IKPqjiKLYxajwxJCiBgkMSSDa3ev0W9dP34+8jNuBdxY8c4KvIt4Gx2WEELEShKDHWmt+fHPH+m/rj//PvyX0XVGM7jGYDI6ZDQ6NCGEiJMkBju5EHaBXgG9CDgZQNUiVZnffD4VClQwOiwhhEiQJIYkZtEW5uyfw0cbP8KszUxpPIW+VfpK0TshRKohiSEJnbxxkm6ruvH7ud+pX7I+c5rNoVTuUkaHJYQQz0QSQxKItEQy5Y8pDA8cTiaHTMxrNo8ulbpIOQshRKokieEFHb5yGD9/P4IuBfFG2TeY0WQGhbMXNjosIYR4bpIYntPDyIeM3T6W8TvGk8cpD0tbLaVV+VayliCESPUkMTyHPy78gZ+/H39f/5v2FdszpfEU8mbJa3RYQgiRJF7oyi9KqbeVUkeVUhallNcT04YqpU4ppY4rpRpHa6+slPrTOu0rZf2JrZTKpJRaYm3fo5RyfpHY7OHuo7sMWDcAn299uPPoDmvareG7t76TpCCESFNe9JJgR4AWwLbojUqp8kAboALgC8xQSj0+XnMm0B0obb35Wtv9gFta61eAKcDnLxhbktp0ZhOuM12Ztmcavbx6caT3EV4r/ZrRYQkhRJJ7ocSgtf5ba308lklvAD9rrR9qrc8Cp4AqSqlCQA6t9R9aaw18B7wZbZ5F1sfLgfoqBWywv/3gNn4r/Wj4fUMcTY783ul3pjeZTo5MOYwOTQgh7MJe+xiKALujPQ+1tkVYHz/Z/nieCwBa60ilVBiQF7j+5OBKqe5ErXVQvHjxpI7dZsWxFfQO6M3Vu1cZ4jOE4bWH4+ToZLflCSFESpBgYlBKbQIKxjLpY631yrhmi6VNx9Me3zxPN2o9B5gD4OXlFWufF3HlzhX6ru3Lsr+W4f6SO6varqJy4cpJvRghhEiREkwMWusGzzFuKFAs2vOiwCVre9FY2qPPE6qUygDkBG4+x7Kfm9aa7w9/z4B1A7gbcZex9cYy6NVBODo4JmcYQghhqBfd+RwXf6CN9UijkkTtZN6rtb4MhCulqln3H3QAVkabp6P1cStgi3U/RLI4H3ae1398nY4rOuKSz4XgHsEMqzlMkoIQIt15oX0MSqm3gK+B/ECAUipYa91Ya31UKbUU+AuIBPporc3W2XoBCwEnYK31BjAf+F4pdYqoNYU2LxJbYlm0hZn7ZjJk8xC01nzl+xW9vXtL0TshRLqlkvFHuV14eXnpoKCg55r3+PXjdF3VlR3nd9CwVEPmNJuDcy7npA1QCCFSIKXUfq21V2zT0u2Zz98e/JbeAb1xcnRiwRsL6OjeUcpZCCEE6TgxlMlbhqZlmvLN699QMFtsB10JIUT6lG4TQ43iNahRvIbRYQghRIpjr6OShBBCpFKSGIQQQsQgiUEIIUQMkhiEEELEIIlBCCFEDJIYhBBCxCCJQQghRAySGIQQQsSQ6mslKaWuAeeMjuM55COWixClcentNae31wvymlOTElrr/LFNSPWJIbVSSgXFVcAqrUpvrzm9vV6Q15xWyKYkIYQQMUhiEEIIEYMkBuPMMToAA6S315zeXi/Ia04TZB+DEEKIGGSNQQghRAySGIQQQsQgicFgSqmBSimtlMpndCz2ppSaqJQ6ppQ6rJT6TSmVy+iY7EUp5auUOq6UOqWUGmJ0PPamlCqmlNqqlPpbKXVUKdXf6JiSg1LKQSl1UCm12uhYkpIkBgMppYoBDYHzRseSTDYCrlrrisAJYKjB8diFUsoBmA68BpQH2iqlyhsbld1FAh9qrcsB1YA+6eA1A/QH/jY6iKQmicFYU4CPgHRxBIDWeoPWOtL6dDdQ1Mh47KgKcEprfUZr/Qj4GXjD4JjsSmt9WWt9wPo4nKgvyyLGRmVfSqmiQBNgntGxJDVJDAZRSjUHLmqtDxkdi0G6AGuNDsJOigAXoj0PJY1/SUanlHIGKgF7jI3E7qYS9cPOYnQgSS2D0QGkZUqpTUDBWCZ9DAwDGiVvRPYX32vWWq+09vmYqE0Pi5MztmSkYmlLF2uFSqlswC/AAK31v0bHYy9KqabAVa31fqVUHaPjSWqSGOxIa90gtnallBtQEjiklIKoTSoHlFJVtNb/JGOISS6u1/yYUqoj0BSor9PuSTShQLFoz4sClwyKJdkopRyJSgqLtda/Gh2PnfkAzZVSrwOZgRxKqR+01u8ZHFeSkBPcUgClVAjgpbVOjRUaE00p5QtMBmprra8ZHY+9KKUyELVzvT5wEdgHtNNaHzU0MDtSUb9wFgE3tdYDjI4nOVnXGAZqrZsaHUtSkX0MIjl9A2QHNiqlgpVSs4wOyB6sO9jfB9YTtRN2aVpOClY+QHugnvV/G2z9NS1SIVljEEIIEYOsMQghhIhBEoMQQogYJDEIIYSIQRKDEEKIGCQxCCGEiEESgxBCiBgkMQghhIjh/39tZYMAmmdpAAAAAElFTkSuQmCC\n",
- "text/plain": [
- "
"
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "plt.scatter(x, y_noisy, label='empirical data points')\n",
- "plt.plot(x, y, color='black', label='true relationship')\n",
- "plt.plot(inputs, outputs, color='red', label='linear regression (cpu)')\n",
- "plt.plot(inputs, outputs_gpu.to_array(), color='green', label='ridge regression (gpu)')\n",
- "plt.legend()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "## K Nearest Neighbors\n",
- "\n",
- "NearestNeighbors is a unsupervised algorithm where if one wants to find the “closest” datapoint(s) to new unseen data, one can calculate a suitable “distance” between each and every point, and return the top K datapoints which have the smallest distance to it.\n",
- "\n",
- "We'll generate some fake data using the `make_moons` function from the `sklearn.datasets` module. This function generates data points from two equations, each describing a half circle with a unique center. Since each data point is generated by one of these two equations, the cluster each data point belongs to is clear. The ideal classification algorithm will identify two clusters and associate each data point with the equation that generated it. \n",
- "\n",
- "These data points are generated using a non-linear relationship - so using a linear regression approach won't adequately solve problem. Instead, we can use a distance-based algorithm K Nearest Neighbors to classify each data point.\n",
- "\n",
- "First, let's generate out data."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 18,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "(1000, 2)\n"
- ]
- }
- ],
- "source": [
- "from sklearn.datasets import make_moons\n",
- "\n",
- "\n",
- "X, y = make_moons(n_samples=int(1e3), noise=0.05, random_state=0)\n",
- "print(X.shape)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Let's visualize our data:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 19,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- "
"
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "figure = plt.figure()\n",
- "axis = figure.add_subplot(111)\n",
- "axis.scatter(X[y == 0, 0], X[y == 0, 1], \n",
- " edgecolor='black',\n",
- " c='lightblue', marker='o', s=40, label='cluster 1')\n",
- "\n",
- "axis.scatter(X[y == 1, 0], X[y == 1, 1], \n",
- " edgecolor='black',\n",
- " c='red', marker='s', s=40, label='cluster 2')\n",
- "plt.legend()\n",
- "plt.tight_layout()\n",
- "plt.show()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Before we build a KNN classification model, we first have to convert our data to a cuDF representation."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 20,
- "metadata": {},
- "outputs": [],
- "source": [
- "X_df = cudf.DataFrame()\n",
- "for column in range(X.shape[1]):\n",
- " X_df['feature_' + str(column)] = np.ascontiguousarray(X[:, column])\n",
- "\n",
- "y_df = cudf.Series(y)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Next, we'll instantiate and fit a nearest neighbors model using the `NearestNeighbors` class from cuML."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 21,
- "metadata": {},
- "outputs": [],
- "source": [
- "from cuml.neighbors import NearestNeighbors\n",
- "\n",
- "\n",
- "knn = NearestNeighbors()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 22,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "NearestNeighbors(n_neighbors=5, verbose=False, handle=, algorithm='brute', metric='euclidean')"
- ]
- },
- "execution_count": 22,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "knn.fit(X_df)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Once our model has been built and fitted to the data, we can query the model for the `k` nearest neighbors to each data point. The query returns a matrix representating the distances of each data point to its nearest `k` neighbors as well as the indices of those neighbors."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 23,
- "metadata": {},
- "outputs": [],
- "source": [
- "k = 3\n",
- "\n",
- "distances, indices = knn.kneighbors(X_df, n_neighbors=k)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can iterate through each of our data points and do a majority vote to determine which class it belongs to."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 24,
- "metadata": {},
- "outputs": [],
- "source": [
- "predictions = []\n",
- "\n",
- "for i in range(indices.shape[0]):\n",
- " row = indices.iloc[i, :]\n",
- " vote = sum(y_df[j] for j in row) / k\n",
- " predictions.append(1.0 * (vote > 0.5))\n",
- "\n",
- "predictions = np.asarray(predictions).astype(np.float32)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Lastly, we can visualize the predictions from our K Nearest Neighbors classifier - we see that despite the non-linearity of the data, the algorithm does an excellent job of classifying the data."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 25,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- "
"
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "f, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 3))\n",
- "\n",
- "\n",
- "ax1.scatter(X[y == 0, 0], X[y == 0, 1],\n",
- " edgecolor='black',\n",
- " c='lightblue', marker='o', s=40, label='cluster 1')\n",
- "ax1.scatter(X[y == 1, 0], X[y == 1, 1],\n",
- " edgecolor='black',\n",
- " c='red', marker='s', s=40, label='cluster 2')\n",
- "ax1.set_title('empirical data points')\n",
- "\n",
- "\n",
- "ax2.scatter(X[predictions == 0, 0], X[predictions == 0, 1], c='lightblue',\n",
- " edgecolor='black',\n",
- " marker='o', s=40, label='cluster 1')\n",
- "ax2.scatter(X[predictions == 1, 0], X[predictions == 1, 1], c='red',\n",
- " edgecolor='black',\n",
- " marker='s', s=40, label='cluster 2')\n",
- "ax2.set_title('KNN predicted classes')\n",
- "\n",
- "plt.legend()\n",
- "plt.tight_layout()\n",
- "plt.show()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "## Conclusion\n",
- "\n",
- "In this notebook, we showed to do GPU accelerated Supervised Learning in RAPIDS. \n",
- "\n",
- "To learn more about RAPIDS, be sure to check out: \n",
- "\n",
- "* [Open Source Website](http://rapids.ai)\n",
- "* [GitHub](https://github.com/rapidsai/)\n",
- "* [Press Release](https://nvidianews.nvidia.com/news/nvidia-introduces-rapids-open-source-gpu-acceleration-platform-for-large-scale-data-analytics-and-machine-learning)\n",
- "* [NVIDIA Blog](https://blogs.nvidia.com/blog/2018/10/10/rapids-data-science-open-source-community/)\n",
- "* [Developer Blog](https://devblogs.nvidia.com/gpu-accelerated-analytics-rapids/)\n",
- "* [NVIDIA Data Science Webpage](https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/)\n"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.6.7"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
diff --git a/intermediate_notebooks/E2E/census/census_education2income_demo.ipynb b/intermediate_notebooks/E2E/census/census_education2income_demo.ipynb
deleted file mode 100644
index 4ee35863..00000000
--- a/intermediate_notebooks/E2E/census/census_education2income_demo.ipynb
+++ /dev/null
@@ -1,519 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Census Notebook\n",
- "**Authorship** \n",
- "Original Author: Taurean Dyer \n",
- "Last Edit: Taurean Dyer, 9/26/2019 \n",
- "\n",
- "**Test System Specs** \n",
- "Test System Hardware: GV100 \n",
- "Test System Software: Ubuntu 18.04 \n",
- "RAPIDS Version: 0.10.0a - Docker Install \n",
- "Driver: 410.79 \n",
- "CUDA: 10.0 \n",
- "\n",
- "\n",
- "**Known Working Systems** \n",
- "RAPIDS Versions:0.8, 0.9, 0.10\n",
- "\n",
- "# Intro\n",
- "Held every 10 years, the US census gives a detailed snapshot in time about the makeup of the country. The last census in 2010 surveyed nearly 309 million people. IPUMS.org provides researchers an open source data set with 1% to 10% of the census data set. In this notebook, we want to see how education affects total income earned in the US based on data from each census from the 1970 to 2010 and see if we can predict some results if the census was held today, according to the national average. We will go through the ETL, training the model, and then testing the prediction. We'll make every effort to get as balanced of a dataset as we can. We'll also pull some extra variables to allow for further self-exploration of gender based education and income breakdowns. On a single Titan RTX, you can run the whole notebook workflow on the 4GB dataset of 14 million rows by 44 columns in less than 3 minutes. "
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "**Let's begin!**"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Imports"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import pandas as pd\n",
- "import numpy as np\n",
- "import cuml\n",
- "import cudf\n",
- "import dask_cudf\n",
- "import sys\n",
- "import os\n",
- "from pprint import pprint\n",
- "import warnings\n",
- "warnings.filterwarnings('ignore')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Get your data!"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The ipums dataset is in our S3 bucket and zipped. \n",
- "1. We'll need to create a folder for our data in the `/data` folder\n",
- "1. Download the zipped data into that folder from S3\n",
- "1. Load the zipped data quickly into cudf using it's read_csv() parameters"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import urllib.request\n",
- "\n",
- "data_dir = '../../../data/census/'\n",
- "if not os.path.exists(data_dir):\n",
- " print('creating census data directory')\n",
- " os.system('mkdir ../../../data/census')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# download the IPUMS dataset\n",
- "base_url = 'https://rapidsai-data.s3.us-east-2.amazonaws.com/datasets/'\n",
- "fn = 'ipums_education2income_1970-2010.csv.gz'\n",
- "if not os.path.isfile(data_dir+fn):\n",
- " print(f'Downloading {base_url+fn} to {data_dir+fn}')\n",
- " urllib.request.urlretrieve(base_url+fn, data_dir+fn)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "def load_data(cached = data_dir+fn):\n",
- " if os.path.exists(cached):\n",
- " print('use ipums data')\n",
- " X = cudf.read_csv(cached, compression='infer')\n",
- " else:\n",
- " print(\"No data found! Please check your that your data directory is ../../../data/census/ and that you downloaded the data. If you did, please delete the `../../../data/census/` directory and try the above 2 cells again\")\n",
- " X = null\n",
- " return X"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "df = load_data(data_dir+fn)\n",
- "print('data',df.shape)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "print(df.head(5).to_pandas())"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "df.dtypes"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "original_counts = df.YEAR.value_counts()\n",
- "print(original_counts) ### Remember these numbers!"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## ETL"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Cleaning Income data\n",
- "First, let's focus on cleaning out the bad values for Total Income `INCTOT`. First, let's see if there are an `N/A` values, as when we did `head()`, we saw some in other columns, like CBSERIAL"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "df['INCTOT_NA'] = df['INCTOT'].isna()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "print(df.INCTOT_NA.value_counts())"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Okay, great, there are no `N/A`s...or are there? Let's drop `INCTOT_NA` and see what our value counts look like"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "df=df.drop('INCTOT_NA')\n",
- "print(df.INCTOT.value_counts().to_pandas()) ### Wow, look how many people in America make $10,000,000! Wait a minutes... "
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Not that many people make $10M a year. Checking https://usa.ipums.org/usa-action/variables/INCTOT#codes_section, `9999999`is INCTOT's code for `N/A`. That was why when we ran `isna`, RAPIDS won't find any. Let's first create a new dataframe that is only NA values, then let's pull those encoded `N/A`s out of our working dataframe!"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "print('data',df.shape)\n",
- "tdf = df.query('INCTOT == 9999999')\n",
- "df = df.query('INCTOT != 9999999')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "print('working data',df.shape)\n",
- "print('junk count data',tdf.shape)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We're down by nearly 1/4 of our original dataset size. For the curious, now we should be able to get accurate Total Income data, by year, not taking into account inflation"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "print(df.groupby('YEAR')['INCTOT'].mean()) # without that cleanup, the average would have bene in the millions...."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### Normalize Income for inflation\n",
- "Now that we have reduced our dataframe to a baseline clean data to answer our question, we should normalize the amounts for inflation. `CPI99`is the value that IPUMS uses to contian the inflation factor. All we have to do is multipy by year. Let's see how that changes the Total Income values from just above!"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "print(df.groupby('YEAR')['CPI99'].mean()) ## it just returns the CPI99\n",
- "df['INCTOT'] = df['INCTOT'] * df['CPI99']\n",
- "print(df.groupby('YEAR')['INCTOT'].mean()) ## let's see what we got!"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Cleaning Education Data\n",
- "Okay, great! Now we have income cleaned up, it should also have cleaned much of our next sets of values of interes, namely Education and Education Detailed. However, there are still some `N/A`s in key variables to worry about, which can cause problmes later. Let's create a list of them..."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "suspect = ['CBSERIAL','EDUC', 'EDUCD', 'EDUC_HEAD', 'EDUC_POP', 'EDUC_MOM','EDUCD_MOM2','EDUCD_POP2', 'INCTOT_MOM','INCTOT_POP','INCTOT_MOM2','INCTOT_POP2', 'INCTOT_HEAD']"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "for i in range(0, len(suspect)):\n",
- " df[suspect[i]] = df[suspect[i]].fillna(-1)\n",
- " print(suspect[i], df[suspect[i]].value_counts())"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Let's get drop any rows of any `-1`s in Education and Education Detailed."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "totincome = ['EDUC','EDUCD']\n",
- "for i in range(0, len(totincome)):\n",
- " query = totincome[i] + ' != -1'\n",
- " df = df.query(query)\n",
- " print(totincome[i])"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "print(df.shape)\n",
- "df.head().to_pandas().head()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Well, the good news is that we lost no further rows, start to normalize the data so when we do our OLS, one year doesn't unfairly dominate the data"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Normalize the Data\n",
- "The in the last step, need to keep our data at about the same ratio as we when started (1% of the population), with the exception of 1980, which was a 5% and needs to be reduced. This is why we kept the temp dataframe `tdf` - to get the counts per year. we will find out just how many have to realize"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "print('Working data: \\n', df.YEAR.value_counts())\n",
- "print('junk count data: \\n', tdf.YEAR.value_counts())"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "And now, so that we can do MSE, let's make all the dtypes the same. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "df.dtypes"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "\n",
- "keep_cols = ['YEAR', 'DATANUM', 'SERIAL', 'CBSERIAL', 'HHWT', 'GQ', 'PERNUM', 'SEX', 'AGE', 'INCTOT', 'EDUC', 'EDUCD', 'EDUC_HEAD', 'EDUC_POP', 'EDUC_MOM','EDUCD_MOM2','EDUCD_POP2', 'INCTOT_MOM','INCTOT_POP','INCTOT_MOM2','INCTOT_POP2', 'INCTOT_HEAD', 'SEX_HEAD']\n",
- "df = df.loc[:, keep_cols]\n",
- "#df = df.drop(col for col in df.columns if col not in keep_cols)\n",
- "for i in range(0, len(keep_cols)):\n",
- " df[keep_cols[i]] = df[keep_cols[i]].fillna(-1)\n",
- " print(keep_cols[i], df[keep_cols[i]].value_counts())\n",
- " df[keep_cols[i]]= df[keep_cols[i]].astype('float64')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "## I WANTED TO REDUCE THE 1980 SAMPLE HERE, BUT .SAMPLE() IS NEEDED AND NOT WORKING, UNLESS THERE IS A WORK AROUND..."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "With the important data now clean and normalized, let's start doing the regression"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Ridge Regression\n",
- "We have 44 variables. The other variables may provide important predictive information. The Ridge Regression technique with cross validation to identify the best hyperparamters may be the best way to get the most accurate model. We'll have to \n",
- "\n",
- "* define our performance metrics\n",
- "* split our data into train and test sets\n",
- "* train and test our model\n",
- "\n",
- "Let's begin and see what we get!"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# As our performance metrics we'll use a basic mean squared error and coefficient of determination implementation\n",
- "def mse(y_test, y_pred):\n",
- " return ((y_test.reset_index(drop=True) - y_pred.reset_index(drop=True)) ** 2).mean()\n",
- "\n",
- "def cod(y_test, y_pred):\n",
- " y_bar = y_test.mean()\n",
- " total = ((y_test - y_bar) ** 2).sum()\n",
- " residuals = ((y_test.reset_index(drop=True) - y_pred.reset_index(drop=True)) ** 2).sum()\n",
- " return 1 - (residuals / total)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from cuml.preprocessing.model_selection import train_test_split\n",
- "trainsize = .9\n",
- "yCol = \"EDUC\"\n",
- "from cuml.preprocessing.model_selection import train_test_split\n",
- "from cuml.linear_model.ridge import Ridge\n",
- "\n",
- "def train_and_score(data, clf, train_frac=0.8, n_runs=20):\n",
- " mse_scores, cod_scores = [], []\n",
- " for _ in range(n_runs):\n",
- " X_train, X_test, y_train, y_test = cuml.preprocessing.model_selection.train_test_split(df, yCol, train_size=.9)\n",
- " y_pred = clf.fit(X_train, y_train).predict(X_test)\n",
- " mse_scores.append(mse(y_test, y_pred))\n",
- " cod_scores.append(cod(y_test, y_pred))\n",
- " return mse_scores, cod_scores"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- " ## Results\n",
- " **Moment of truth! Let's see how our regression training does!**"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import numpy as np\n",
- "n_runs = 20\n",
- "clf = Ridge()\n",
- "mse_scores, cod_scores = train_and_score(df, clf, n_runs=n_runs)\n",
- "print(f\"median MSE ({n_runs} runs): {np.median(mse_scores)}\")\n",
- "print(f\"median COD ({n_runs} runs): {np.median(cod_scores)}\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "**Fun fact:** if you made INCTOT the y axis, your prediction results would not be so pretty! It just shows that your education level can be an indicator for your income, but your income is NOT a great predictor for your education level. You have better odds flipping a coin!\n",
- "\n",
- "* median MSE (50 runs): 518189521.07548225\n",
- "* median COD (50 runs): 0.425769113846303"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Next Steps/Self Study\n",
- "* You can pickle the model and use it in another workflow\n",
- "* You can redo the workflow with based on head of household using `EDUC`, `SEX`, and `INCTOT` for X in `X`_HEAD\n",
- "* You can see the growing role of education with women in their changing role in the workforce and income with \"EDUC_MOM\" and \"EDUC_POP"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.7.3"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
diff --git a/intermediate_notebooks/E2E/mortgage/mortgage_e2e.ipynb b/intermediate_notebooks/E2E/mortgage/mortgage_e2e.ipynb
deleted file mode 100644
index 8fb2de06..00000000
--- a/intermediate_notebooks/E2E/mortgage/mortgage_e2e.ipynb
+++ /dev/null
@@ -1,893 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Mortgage Workflow\n",
- "\n",
- "## The Dataset\n",
- "The dataset used with this workflow is derived from [Fannie Mae’s Single-Family Loan Performance Data](http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html) with all rights reserved by Fannie Mae. This processed dataset is redistributed with permission and consent from Fannie Mae.\n",
- "\n",
- "To acquire this dataset, please visit [RAPIDS Datasets Homepage](https://docs.rapids.ai/datasets/mortgage-data)\n",
- "\n",
- "## Introduction\n",
- "The Mortgage workflow is composed of three core phases:\n",
- "\n",
- "1. ETL - Extract, Transform, Load\n",
- "2. Data Conversion\n",
- "3. ML - Training\n",
- "\n",
- "### ETL\n",
- "Data is \n",
- "1. Read in from storage\n",
- "2. Transformed to emphasize key features\n",
- "3. Loaded into volatile memory for conversion\n",
- "\n",
- "### Data Conversion\n",
- "Features are\n",
- "1. Broken into (labels, data) pairs\n",
- "2. Distributed across many workers\n",
- "3. Converted into compressed sparse row (CSR) matrix format for XGBoost\n",
- "\n",
- "### Machine Learning\n",
- "The CSR data is fed into a distributed training session with `xgboost.dask`"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "---\n",
- "If required, the notebook can be converted to a python script for execution using tools like `nbconvert`\n",
- "\n",
- "```sh\n",
- "$ jupyter nbconvert --to python mortgage_e2e.ipynb\n",
- "$ python mortgage_e2e.py\n",
- "```\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### Imports statements"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {},
- "outputs": [],
- "source": [
- "from utils.utils import (\n",
- " determine_dataset,\n",
- " get_data,\n",
- " memory_info,\n",
- ")\n",
- "\n",
- "from dask_cuda import LocalCUDACluster\n",
- "from dask.delayed import delayed\n",
- "from dask.distributed import Client, wait\n",
- "import rmm\n",
- "\n",
- "import numpy as np\n",
- "\n",
- "from collections import OrderedDict\n",
- "import argparse\n",
- "import gc\n",
- "from glob import glob\n",
- "import os\n",
- "import subprocess\n",
- "import time"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### Define functions to encapsulate the workflow into a single call"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {},
- "outputs": [],
- "source": [
- "def run_dask_task(func, **kwargs):\n",
- " task = func(**kwargs)\n",
- " return task\n",
- "\n",
- "\n",
- "def process_quarter_gpu(\n",
- " year=2000, quarter=1, perf_file=\"\", data_dir=\"\", client=None, **kwargs\n",
- "):\n",
- " ml_arrays = run_dask_task(\n",
- " delayed(run_gpu_workflow), quarter=quarter, year=year, perf_file=perf_file\n",
- " )\n",
- " return client.compute(ml_arrays, optimize_graph=False, fifo_timeout=\"0ms\")\n",
- "\n",
- "\n",
- "def run_gpu_workflow(\n",
- " quarter=1, year=2000, perf_file=\"\", acq_file=\"\", names_file=\"\", **kwargs\n",
- "):\n",
- " names = gpu_load_names(col_names_path=data_dir + \"names.csv\")\n",
- " names = hash_df_string_columns(names)\n",
- " acq_gdf = gpu_load_acquisition_csv(\n",
- " acquisition_path=data_dir\n",
- " + \"acq\"\n",
- " + \"/Acquisition_\"\n",
- " + str(year)\n",
- " + \"Q\"\n",
- " + str(quarter)\n",
- " + \".txt\"\n",
- " )\n",
- " acq_gdf = hash_df_string_columns(acq_gdf)\n",
- " acq_gdf = acq_gdf.merge(names, how=\"left\", on=[\"seller_name\"])\n",
- " acq_gdf.drop_column(\"seller_name\")\n",
- " acq_gdf[\"seller_name\"] = acq_gdf[\"new\"]\n",
- " acq_gdf.drop_column(\"new\")\n",
- " perf_df_tmp = gpu_load_performance_csv(perf_file)\n",
- " perf_df_tmp = hash_df_string_columns(perf_df_tmp)\n",
- " gdf = perf_df_tmp\n",
- " everdf = create_ever_features(gdf)\n",
- " delinq_merge = create_delinq_features(gdf)\n",
- " everdf = join_ever_delinq_features(everdf, delinq_merge)\n",
- " del delinq_merge\n",
- " joined_df = create_joined_df(gdf, everdf)\n",
- " testdf = create_12_mon_features(joined_df)\n",
- " joined_df = combine_joined_12_mon(joined_df, testdf)\n",
- " del testdf\n",
- " perf_df = final_performance_delinquency(gdf, joined_df)\n",
- " del (gdf, joined_df)\n",
- " final_gdf = join_perf_acq_gdfs(perf_df, acq_gdf)\n",
- " del perf_df\n",
- " del acq_gdf\n",
- " final_gdf = last_mile_cleaning(final_gdf)\n",
- " return final_gdf\n",
- "\n",
- "\n",
- "def gpu_load_performance_csv(performance_path, **kwargs):\n",
- " \"\"\" \n",
- " Loads performance data\n",
- "\n",
- " Returns\n",
- " -------\n",
- " GPU DataFrame\n",
- " \"\"\"\n",
- "\n",
- " cols = [\n",
- " \"loan_id\",\n",
- " \"monthly_reporting_period\",\n",
- " \"servicer\",\n",
- " \"interest_rate\",\n",
- " \"current_actual_upb\",\n",
- " \"loan_age\",\n",
- " \"remaining_months_to_legal_maturity\",\n",
- " \"adj_remaining_months_to_maturity\",\n",
- " \"maturity_date\",\n",
- " \"msa\",\n",
- " \"current_loan_delinquency_status\",\n",
- " \"mod_flag\",\n",
- " \"zero_balance_code\",\n",
- " \"zero_balance_effective_date\",\n",
- " \"last_paid_installment_date\",\n",
- " \"foreclosed_after\",\n",
- " \"disposition_date\",\n",
- " \"foreclosure_costs\",\n",
- " \"prop_preservation_and_repair_costs\",\n",
- " \"asset_recovery_costs\",\n",
- " \"misc_holding_expenses\",\n",
- " \"holding_taxes\",\n",
- " \"net_sale_proceeds\",\n",
- " \"credit_enhancement_proceeds\",\n",
- " \"repurchase_make_whole_proceeds\",\n",
- " \"other_foreclosure_proceeds\",\n",
- " \"non_interest_bearing_upb\",\n",
- " \"principal_forgiveness_upb\",\n",
- " \"repurchase_make_whole_proceeds_flag\",\n",
- " \"foreclosure_principal_write_off_amount\",\n",
- " \"servicing_activity_indicator\",\n",
- " ]\n",
- "\n",
- " dtypes = OrderedDict(\n",
- " [\n",
- " (\"loan_id\", \"int64\"),\n",
- " (\"monthly_reporting_period\", \"date\"),\n",
- " (\"servicer\", \"str\"),\n",
- " (\"interest_rate\", \"float64\"),\n",
- " (\"current_actual_upb\", \"float64\"),\n",
- " (\"loan_age\", \"float64\"),\n",
- " (\"remaining_months_to_legal_maturity\", \"float64\"),\n",
- " (\"adj_remaining_months_to_maturity\", \"float64\"),\n",
- " (\"maturity_date\", \"date\"),\n",
- " (\"msa\", \"float64\"),\n",
- " (\"current_loan_delinquency_status\", \"int32\"),\n",
- " (\"mod_flag\", \"str\"),\n",
- " (\"zero_balance_code\", \"str\"),\n",
- " (\"zero_balance_effective_date\", \"date\"),\n",
- " (\"last_paid_installment_date\", \"date\"),\n",
- " (\"foreclosed_after\", \"date\"),\n",
- " (\"disposition_date\", \"date\"),\n",
- " (\"foreclosure_costs\", \"float64\"),\n",
- " (\"prop_preservation_and_repair_costs\", \"float64\"),\n",
- " (\"asset_recovery_costs\", \"float64\"),\n",
- " (\"misc_holding_expenses\", \"float64\"),\n",
- " (\"holding_taxes\", \"float64\"),\n",
- " (\"net_sale_proceeds\", \"float64\"),\n",
- " (\"credit_enhancement_proceeds\", \"float64\"),\n",
- " (\"repurchase_make_whole_proceeds\", \"float64\"),\n",
- " (\"other_foreclosure_proceeds\", \"float64\"),\n",
- " (\"non_interest_bearing_upb\", \"float64\"),\n",
- " (\"principal_forgiveness_upb\", \"float64\"),\n",
- " (\"repurchase_make_whole_proceeds_flag\", \"str\"),\n",
- " (\"foreclosure_principal_write_off_amount\", \"float64\"),\n",
- " (\"servicing_activity_indicator\", \"str\"),\n",
- " ]\n",
- " )\n",
- "\n",
- " return cudf.read_csv(\n",
- " performance_path,\n",
- " names=cols,\n",
- " delimiter=\"|\",\n",
- " dtype=list(dtypes.values()),\n",
- " skiprows=1,\n",
- " )\n",
- "\n",
- "\n",
- "def gpu_load_acquisition_csv(acquisition_path, **kwargs):\n",
- " \"\"\" \n",
- " Loads acquisition data\n",
- "\n",
- " Returns\n",
- " -------\n",
- " GPU DataFrame\n",
- " \"\"\"\n",
- "\n",
- " cols = [\n",
- " \"loan_id\",\n",
- " \"orig_channel\",\n",
- " \"seller_name\",\n",
- " \"orig_interest_rate\",\n",
- " \"orig_upb\",\n",
- " \"orig_loan_term\",\n",
- " \"orig_date\",\n",
- " \"first_pay_date\",\n",
- " \"orig_ltv\",\n",
- " \"orig_cltv\",\n",
- " \"num_borrowers\",\n",
- " \"dti\",\n",
- " \"borrower_credit_score\",\n",
- " \"first_home_buyer\",\n",
- " \"loan_purpose\",\n",
- " \"property_type\",\n",
- " \"num_units\",\n",
- " \"occupancy_status\",\n",
- " \"property_state\",\n",
- " \"zip\",\n",
- " \"mortgage_insurance_percent\",\n",
- " \"product_type\",\n",
- " \"coborrow_credit_score\",\n",
- " \"mortgage_insurance_type\",\n",
- " \"relocation_mortgage_indicator\",\n",
- " ]\n",
- "\n",
- " dtypes = OrderedDict(\n",
- " [\n",
- " (\"loan_id\", \"int64\"),\n",
- " (\"orig_channel\", \"str\"),\n",
- " (\"seller_name\", \"str\"),\n",
- " (\"orig_interest_rate\", \"float64\"),\n",
- " (\"orig_upb\", \"int64\"),\n",
- " (\"orig_loan_term\", \"int64\"),\n",
- " (\"orig_date\", \"date\"),\n",
- " (\"first_pay_date\", \"date\"),\n",
- " (\"orig_ltv\", \"float64\"),\n",
- " (\"orig_cltv\", \"float64\"),\n",
- " (\"num_borrowers\", \"float64\"),\n",
- " (\"dti\", \"float64\"),\n",
- " (\"borrower_credit_score\", \"float64\"),\n",
- " (\"first_home_buyer\", \"str\"),\n",
- " (\"loan_purpose\", \"str\"),\n",
- " (\"property_type\", \"str\"),\n",
- " (\"num_units\", \"int64\"),\n",
- " (\"occupancy_status\", \"str\"),\n",
- " (\"property_state\", \"str\"),\n",
- " (\"zip\", \"int64\"),\n",
- " (\"mortgage_insurance_percent\", \"float64\"),\n",
- " (\"product_type\", \"str\"),\n",
- " (\"coborrow_credit_score\", \"float64\"),\n",
- " (\"mortgage_insurance_type\", \"float64\"),\n",
- " (\"relocation_mortgage_indicator\", \"str\"),\n",
- " ]\n",
- " )\n",
- "\n",
- " return cudf.read_csv(\n",
- " acquisition_path,\n",
- " names=cols,\n",
- " delimiter=\"|\",\n",
- " dtype=list(dtypes.values()),\n",
- " skiprows=1,\n",
- " )\n",
- "\n",
- "\n",
- "def gpu_load_names(col_names_path=\"\", **kwargs):\n",
- " \"\"\" \n",
- " Loads names used for renaming the banks\n",
- "\n",
- " Returns\n",
- " -------\n",
- " GPU DataFrame\n",
- " \"\"\"\n",
- "\n",
- " cols = [\"seller_name\", \"new\"]\n",
- "\n",
- " dtypes = OrderedDict([(\"seller_name\", \"str\"), (\"new\", \"str\"),])\n",
- "\n",
- " return cudf.read_csv(\n",
- " col_names_path,\n",
- " names=cols,\n",
- " delimiter=\"|\",\n",
- " dtype=list(dtypes.values()),\n",
- " skiprows=1,\n",
- " )\n",
- "\n",
- "\n",
- "def hash_df_string_columns(gdf):\n",
- " \"\"\"\n",
- " Hash all string columns in a cudf dataframe\n",
- "\n",
- " Returns\n",
- " -------\n",
- " Dataframe with all string columns replaced by hashed values for the strings\n",
- " \"\"\"\n",
- " for col in gdf.columns:\n",
- " if cudf.utils.dtypes.is_string_dtype(gdf[col]):\n",
- " gdf[col] = gdf[col].hash_values()\n",
- " return gdf\n",
- "\n",
- "\n",
- "def create_ever_features(gdf, **kwargs):\n",
- " \"\"\"\n",
- " Creates features denoting whether a loan_id has ever been delinquent\n",
- " for over 30, 90 and 180 days.\n",
- " \"\"\"\n",
- " everdf = gdf[[\"loan_id\", \"current_loan_delinquency_status\"]]\n",
- " everdf = everdf.groupby(\"loan_id\", method=\"hash\", as_index=False).max()\n",
- " del gdf\n",
- " everdf[\"ever_30\"] = (everdf[\"current_loan_delinquency_status\"] >= 1).astype(\"int8\")\n",
- " everdf[\"ever_90\"] = (everdf[\"current_loan_delinquency_status\"] >= 3).astype(\"int8\")\n",
- " everdf[\"ever_180\"] = (everdf[\"current_loan_delinquency_status\"] >= 6).astype(\"int8\")\n",
- " everdf.drop_column(\"current_loan_delinquency_status\")\n",
- " return everdf\n",
- "\n",
- "\n",
- "def create_delinq_features(gdf, **kwargs):\n",
- " \"\"\"\n",
- " Computes features denoting the earliest reported date when a loan_id\n",
- " became delinquent for more than 30, 90 and 180 days.\n",
- " \"\"\"\n",
- " delinq_gdf = gdf[\n",
- " [\"loan_id\", \"monthly_reporting_period\", \"current_loan_delinquency_status\",]\n",
- " ]\n",
- " del gdf\n",
- " delinq_30 = (\n",
- " delinq_gdf.query(\"current_loan_delinquency_status >= 1\")[\n",
- " [\"loan_id\", \"monthly_reporting_period\"]\n",
- " ]\n",
- " .groupby(\"loan_id\", method=\"hash\", as_index=False)\n",
- " .min()\n",
- " )\n",
- " delinq_30[\"delinquency_30\"] = delinq_30[\"monthly_reporting_period\"]\n",
- " delinq_30.drop_column(\"monthly_reporting_period\")\n",
- " delinq_90 = (\n",
- " delinq_gdf.query(\"current_loan_delinquency_status >= 3\")[\n",
- " [\"loan_id\", \"monthly_reporting_period\"]\n",
- " ]\n",
- " .groupby(\"loan_id\", method=\"hash\", as_index=False)\n",
- " .min()\n",
- " )\n",
- " delinq_90[\"delinquency_90\"] = delinq_90[\"monthly_reporting_period\"]\n",
- " delinq_90.drop_column(\"monthly_reporting_period\")\n",
- " delinq_180 = (\n",
- " delinq_gdf.query(\"current_loan_delinquency_status >= 6\")[\n",
- " [\"loan_id\", \"monthly_reporting_period\"]\n",
- " ]\n",
- " .groupby(\"loan_id\", method=\"hash\", as_index=False)\n",
- " .min()\n",
- " )\n",
- " delinq_180[\"delinquency_180\"] = delinq_180[\"monthly_reporting_period\"]\n",
- " delinq_180.drop_column(\"monthly_reporting_period\")\n",
- " del delinq_gdf\n",
- " delinq_merge = delinq_30.merge(delinq_90, how=\"left\", on=[\"loan_id\"], type=\"hash\")\n",
- " delinq_merge = delinq_merge.merge(\n",
- " delinq_180, how=\"left\", on=[\"loan_id\"], type=\"hash\"\n",
- " )\n",
- " del delinq_30\n",
- " del delinq_90\n",
- " del delinq_180\n",
- " return delinq_merge\n",
- "\n",
- "\n",
- "def join_ever_delinq_features(everdf_tmp, delinq_merge, **kwargs):\n",
- " \"\"\"\n",
- " Merges the ever and delinq features table on loan_id\n",
- " \"\"\"\n",
- " everdf = everdf_tmp.merge(delinq_merge, on=[\"loan_id\"], how=\"left\", type=\"hash\")\n",
- " del everdf_tmp\n",
- " del delinq_merge\n",
- " return everdf\n",
- "\n",
- "\n",
- "def create_joined_df(gdf, everdf, **kwargs):\n",
- " \"\"\"\n",
- " Join the performance table with the features table. (delinq and ever features)\n",
- " \"\"\"\n",
- " test = gdf[\n",
- " [\n",
- " \"loan_id\",\n",
- " \"monthly_reporting_period\",\n",
- " \"current_loan_delinquency_status\",\n",
- " \"current_actual_upb\",\n",
- " ]\n",
- " ]\n",
- " del gdf\n",
- " test[\"timestamp\"] = test[\"monthly_reporting_period\"]\n",
- " test.drop_column(\"monthly_reporting_period\")\n",
- " test[\"timestamp_month\"] = test[\"timestamp\"].dt.month\n",
- " test[\"timestamp_year\"] = test[\"timestamp\"].dt.year\n",
- " test[\"delinquency_12\"] = test[\"current_loan_delinquency_status\"]\n",
- " test.drop_column(\"current_loan_delinquency_status\")\n",
- " test[\"upb_12\"] = test[\"current_actual_upb\"]\n",
- " test.drop_column(\"current_actual_upb\")\n",
- "\n",
- " joined_df = test.merge(everdf, how=\"left\", on=[\"loan_id\"], type=\"hash\")\n",
- " del everdf\n",
- " del test\n",
- "\n",
- " joined_df[\"timestamp_year\"] = joined_df[\"timestamp_year\"].astype(\"int32\")\n",
- " joined_df[\"timestamp_month\"] = joined_df[\"timestamp_month\"].astype(\"int32\")\n",
- "\n",
- " return joined_df\n",
- "\n",
- "\n",
- "def create_12_mon_features(joined_df, **kwargs):\n",
- " \"\"\"\n",
- " For every loan_id in a 12 month window compute a feature denoting\n",
- " whether it has been delinquent for over 3 months or had an unpaid principal balance.\n",
- " The 12 month window moves by a month to span across all months of the year.\n",
- " \n",
- " The computations windows for each loan_id follows the pattern below\n",
- " Window 1: Jan 2000 - Jan 2001, Jan 2001 - Jan 2002\n",
- " Window 2: Feb 2000- Feb 2001, Feb 2001 - Feb 2002\n",
- " \"\"\"\n",
- " testdfs = []\n",
- " n_months = 12\n",
- " for y in range(1, n_months + 1):\n",
- " tmpdf = joined_df[\n",
- " [\"loan_id\", \"timestamp_year\", \"timestamp_month\", \"delinquency_12\", \"upb_12\"]\n",
- " ]\n",
- " tmpdf[\"josh_months\"] = tmpdf[\"timestamp_year\"] * 12 + tmpdf[\"timestamp_month\"]\n",
- " tmpdf[\"josh_mody_n\"] = (\n",
- " (tmpdf[\"josh_months\"].astype(\"float64\") - 24000 - y) / 12\n",
- " ).floor()\n",
- " tmpdf = tmpdf.groupby(\n",
- " [\"loan_id\", \"josh_mody_n\"], method=\"hash\", as_index=False\n",
- " ).agg({\"delinquency_12\": \"max\", \"upb_12\": \"min\"})\n",
- " tmpdf[\"delinquency_12\"] = (tmpdf[\"delinquency_12\"] > 3).astype(\"int32\")\n",
- " tmpdf[\"delinquency_12\"] += (tmpdf[\"upb_12\"] == 0).astype(\"int32\")\n",
- " tmpdf[\"timestamp_year\"] = (\n",
- " (((tmpdf[\"josh_mody_n\"] * n_months) + 24000 + (y - 1)) / 12)\n",
- " .floor()\n",
- " .astype(\"int16\")\n",
- " )\n",
- " tmpdf[\"timestamp_month\"] = np.int8(y)\n",
- " tmpdf.drop_column(\"josh_mody_n\")\n",
- " testdfs.append(tmpdf)\n",
- " del tmpdf\n",
- " del joined_df\n",
- "\n",
- " return cudf.concat(testdfs)\n",
- "\n",
- "\n",
- "def combine_joined_12_mon(joined_df, testdf, **kwargs):\n",
- " \"\"\"\n",
- " Combines the 12_mon features table with the ever_delinq features tables\n",
- " \"\"\"\n",
- " joined_df.drop_column(\"delinquency_12\")\n",
- " joined_df.drop_column(\"upb_12\")\n",
- " joined_df[\"timestamp_year\"] = joined_df[\"timestamp_year\"].astype(\"int16\")\n",
- " joined_df[\"timestamp_month\"] = joined_df[\"timestamp_month\"].astype(\"int8\")\n",
- " return joined_df.merge(\n",
- " testdf,\n",
- " how=\"left\",\n",
- " on=[\"loan_id\", \"timestamp_year\", \"timestamp_month\"],\n",
- " type=\"hash\",\n",
- " )\n",
- "\n",
- "\n",
- "def final_performance_delinquency(gdf, joined_df, **kwargs):\n",
- " \"\"\"\n",
- " Combines the grouped table with all features with the original Performance table\n",
- " \"\"\"\n",
- " merged = gdf\n",
- " joined_df[\"timestamp_month\"] = joined_df[\"timestamp_month\"].astype(\"int8\")\n",
- " joined_df[\"timestamp_year\"] = joined_df[\"timestamp_year\"].astype(\"int16\")\n",
- " merged[\"timestamp_month\"] = merged[\"monthly_reporting_period\"].dt.month\n",
- " merged[\"timestamp_month\"] = merged[\"timestamp_month\"].astype(\"int8\")\n",
- " merged[\"timestamp_year\"] = merged[\"monthly_reporting_period\"].dt.year\n",
- " merged[\"timestamp_year\"] = merged[\"timestamp_year\"].astype(\"int16\")\n",
- " merged = merged.merge(\n",
- " joined_df,\n",
- " how=\"left\",\n",
- " on=[\"loan_id\", \"timestamp_year\", \"timestamp_month\"],\n",
- " type=\"hash\",\n",
- " )\n",
- " merged.drop_column(\"timestamp_year\")\n",
- " merged.drop_column(\"timestamp_month\")\n",
- " return merged\n",
- "\n",
- "\n",
- "def join_perf_acq_gdfs(perf, acq, **kwargs):\n",
- " \"\"\"\n",
- " Combines the Acquisition and Performance tables on loan_id\n",
- " \"\"\"\n",
- " return perf.merge(acq, how=\"left\", on=[\"loan_id\"], type=\"hash\")\n",
- "\n",
- "\n",
- "def last_mile_cleaning(df, **kwargs):\n",
- " \"\"\"\n",
- " Final cleanup to drop columns not passed to the XGBoost model for training.\n",
- " Convert all string/categorical features to numeric features.\n",
- "\n",
- " Returns\n",
- " ------\n",
- " Arrow Table (Host memory)\n",
- " \"\"\"\n",
- " drop_list = [\n",
- " \"loan_id\",\n",
- " \"orig_date\",\n",
- " \"first_pay_date\",\n",
- " \"seller_name\",\n",
- " \"monthly_reporting_period\",\n",
- " \"last_paid_installment_date\",\n",
- " \"maturity_date\",\n",
- " \"ever_30\",\n",
- " \"ever_90\",\n",
- " \"ever_180\",\n",
- " \"delinquency_30\",\n",
- " \"delinquency_90\",\n",
- " \"delinquency_180\",\n",
- " \"upb_12\",\n",
- " \"zero_balance_effective_date\",\n",
- " \"foreclosed_after\",\n",
- " \"disposition_date\",\n",
- " \"timestamp\",\n",
- " ]\n",
- " for column in drop_list:\n",
- " df.drop_column(column)\n",
- " for col, dtype in df.dtypes.iteritems():\n",
- " if str(dtype) == \"category\":\n",
- " df[col] = df[col].cat.codes\n",
- " df[col] = df[col].astype(\"float32\")\n",
- " df[\"delinquency_12\"] = df[\"delinquency_12\"] > 0\n",
- " df[\"delinquency_12\"] = df[\"delinquency_12\"].fillna(False).astype(\"int32\")\n",
- " for column in df.columns:\n",
- " df[column] = df[column].fillna(np.dtype(str(df[column].dtype)).type(-1))\n",
- " return df.to_arrow(preserve_index=False)\n",
- "\n",
- "\n",
- "def prepare_data(arrow_input):\n",
- " \"\"\"\n",
- " Convert a list of arrow tables to a single GPU dataframe\n",
- " \n",
- " Returns\n",
- " -------\n",
- " GPU Dataframe\n",
- " \"\"\"\n",
- " gpu_dataframes = []\n",
- " for arrow_df in arrow_input:\n",
- " gpu_dataframes.append(cudf.DataFrame.from_arrow(arrow_df))\n",
- "\n",
- " concat_df = cudf.concat(gpu_dataframes)\n",
- " del gpu_dataframes\n",
- " return concat_df\n",
- "\n",
- "\n",
- "def xgb_training(arrow_dfs, client=None):\n",
- " \"\"\"\n",
- " Convert the post ETL data to Dmatrix format for XGBoost training input.\n",
- " Train the XGBoost model.\n",
- " \n",
- " Returns\n",
- " -------\n",
- " The trained model and time taken for preparing, training data.\n",
- " \"\"\"\n",
- " dxgb_gpu_params = {\n",
- " \"max_depth\": 8,\n",
- " \"max_leaves\": 2 ** 8,\n",
- " \"alpha\": 0.9,\n",
- " \"eta\": 0.1,\n",
- " \"gamma\": 0.1,\n",
- " \"learning_rate\": 0.1,\n",
- " \"subsample\": 1,\n",
- " \"reg_lambda\": 1,\n",
- " \"scale_pos_weight\": 2,\n",
- " \"min_child_weight\": 30,\n",
- " \"tree_method\": \"gpu_hist\",\n",
- " \"objective\": \"binary:logistic\",\n",
- " \"grow_policy\": \"lossguide\",\n",
- " }\n",
- " NUM_BOOST_ROUND = 100\n",
- "\n",
- " part_count = len(arrow_dfs)\n",
- " print(f\"Preparing data for training with part count: {part_count}\")\n",
- " t1 = time.time()\n",
- " tmp_map = [\n",
- " (arrow_df, list(client.who_has(arrow_df).values())[0][0])\n",
- " for arrow_df in arrow_dfs\n",
- " ]\n",
- " new_map = OrderedDict()\n",
- " for key, value in tmp_map:\n",
- " if value not in new_map:\n",
- " new_map[value] = [key]\n",
- " else:\n",
- " new_map[value].append(key)\n",
- "\n",
- " del (tmp_map, key, value)\n",
- "\n",
- " train_x_y = []\n",
- " for list_delayed in new_map.values():\n",
- " train_x_y.append(delayed(prepare_data)(list_delayed))\n",
- "\n",
- " del (new_map, list_delayed)\n",
- "\n",
- " worker_list = OrderedDict()\n",
- " for task in train_x_y:\n",
- " worker_list[task] = list(client.who_has(task).values())[0][0]\n",
- "\n",
- " del task\n",
- "\n",
- " persisted_train_x_y = []\n",
- " for task in train_x_y:\n",
- " persisted_train_x_y.append(\n",
- " client.persist(\n",
- " collections=task,\n",
- " workers=worker_list[task],\n",
- " optimize_graph=False,\n",
- " fifo_timeout=\"0ms\",\n",
- " )\n",
- " )\n",
- "\n",
- " del (arrow_dfs, train_x_y, worker_list, task)\n",
- "\n",
- " wait(persisted_train_x_y)\n",
- " persisted_train_x_y = dask_cudf.from_delayed(persisted_train_x_y)\n",
- "\n",
- " dmat = xgb.dask.DaskDMatrix(\n",
- " client=client,\n",
- " data=persisted_train_x_y[\n",
- " persisted_train_x_y.columns.difference([\"delinquency_12\"])\n",
- " ],\n",
- " label=persisted_train_x_y[[\"delinquency_12\"]],\n",
- " missing=-1,\n",
- " )\n",
- "\n",
- " del persisted_train_x_y\n",
- " gc.collect()\n",
- "\n",
- " dmat_time = time.time() - t1\n",
- " print(\"Prepared data for XGB training\")\n",
- "\n",
- " print(\"Training model\")\n",
- " t1 = time.time()\n",
- "\n",
- " print(\"XGB training for part_count:{}\".format(part_count))\n",
- " bst = xgb.dask.train(\n",
- " client, dxgb_gpu_params, dmat, num_boost_round=NUM_BOOST_ROUND,\n",
- " )\n",
- "\n",
- " train_time = time.time() - t1\n",
- " print(\"Training complete\")\n",
- " return (bst, dmat_time, train_time)\n",
- "\n",
- "\n",
- "def run_etl(start_year, end_year, data_dir, client):\n",
- " \"\"\"\n",
- " Driver function for the ETL step\n",
- " \n",
- " Iterates through all files in `data_dir` between `start_year` \n",
- " and `end_year` and calls the ETL function for each file.\n",
- " \n",
- " Returns\n",
- " -------\n",
- " Dask futures to arrow tables containing post ETL data for all processed files.\n",
- " \"\"\"\n",
- " print(\"Starting ETL\")\n",
- " t1 = time.time()\n",
- "\n",
- " perf_data_path = data_dir + \"perf/\"\n",
- "\n",
- " gpu_dfs = []\n",
- " quarter = 1\n",
- " year = start_year\n",
- " count = 0\n",
- " while year <= end_year:\n",
- " for file in glob(\n",
- " os.path.join(\n",
- " perf_data_path + \"/Performance_\" + str(year) + \"Q\" + str(quarter) + \"*\"\n",
- " )\n",
- " ):\n",
- " gpu_dfs.append(\n",
- " process_quarter_gpu(\n",
- " year=year, quarter=quarter, perf_file=file, client=client\n",
- " )\n",
- " )\n",
- " count += 1\n",
- " quarter += 1\n",
- " if quarter == 5:\n",
- " year += 1\n",
- " quarter = 1\n",
- " print(\"ETL for start_year:{} and end_year:{}\".format(start_year, end_year))\n",
- " wait(gpu_dfs)\n",
- "\n",
- " etl_time = time.time() - t1\n",
- "\n",
- " print(\"ETL done!\")\n",
- " return (gpu_dfs, etl_time)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### The cell below runs the workflow end to end including the ETL and XGBoost model training step\n",
- "\n",
- "**Notes** \n",
- "\n",
- "The mortgage dataset for years 2000-2016 is about 200GB of data. There are two key factors that determine the `start_year`, `end_year`, `part_count` and `use_1GB_splits` params used in the notebook for processing this data. \n",
- "\n",
- "_Total GPU memory_: Determines the amount of data that can be trained using XGBoost (`part_count`). The ETL is performed on one part file at a time (per GPU) whereas XGBoost training requires all the training data to be loaded in GPU memory.\n",
- "\n",
- "_Memory per GPU_: Determines the variation of the dataset to use (1GB vs 2GB splits). The 2GB splits version of the data results in larger partitions being processed per task resulting in better utilization of the GPU, with the tradeoff of increased memory usage that can be handled by GPUs cards with greater than `32GB` of memory.\n",
- "\n",
- "The `determine_dataset` utility used below automatically queries these two parameters based on the machine and decides suitable values for `part_count` and consequently `start_year`, `end_year`(to ensure ETL is performed on enough parts for training), as well as the variation of the dataset (1GB split part files vs 2GB split part files) that should work on such systems.\n",
- "\n",
- "If you'd like to use existing data that has already been downloaded to your own location, or manually adjust these parameters based on the amount of data needed for processing, you can change these parameters provided in the notebook, by assigning new values to the variables or setting enivronment variables for `MORTGAGE_DATA_DIR` and `part_count`. You can visit the [RAPIDS Datasets Homepage](https://docs.rapids.ai/datasets/mortgage-data) for more information on downloading the data manually."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Downloading data for year 2000\n",
- "Download complete\n",
- "Decompressing and extracting data\n",
- "Done extracting year 2000\n",
- "Downloading data for year 2001\n",
- "Download complete\n",
- "Decompressing and extracting data\n",
- "Done extracting year 2001\n",
- "Downloading data for year 2002\n",
- "Download complete\n",
- "Decompressing and extracting data\n",
- "Done extracting year 2002\n",
- "Downloading data for year 2003\n",
- "Download complete\n",
- "Decompressing and extracting data\n",
- "Done extracting year 2003\n",
- "Downloading data for year 2004\n",
- "Download complete\n",
- "Decompressing and extracting data\n",
- "Done extracting year 2004\n",
- "Starting ETL\n",
- "ETL for start_year:2000 and end_year:2004\n",
- "ETL done!\n",
- "Preparing data for training with part count: 12\n",
- "Prepared data for XGB training\n",
- "Training model\n",
- "XGB training for part_count:12\n",
- "Training complete\n",
- "\n",
- "Time taken to run ETL from 2000 to 2004 (108 parts) was 68.7227 s\n",
- "Time taken to prepare 12 parts for XGB training 3.3915 s\n",
- "Time taken to train XGB model 87.521 s\n",
- "Total E2E time: 159.6352 s\n"
- ]
- }
- ],
- "source": [
- "if __name__ == \"__main__\":\n",
- "\n",
- " import cudf\n",
- " import xgboost as xgb\n",
- " import dask_cudf\n",
- "\n",
- " cmd = \"hostname --all-ip-addresses\"\n",
- " process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE)\n",
- " output, error = process.communicate()\n",
- " IPADDR = str(output.decode()).split()[0]\n",
- "\n",
- " cluster = LocalCUDACluster(ip=IPADDR)\n",
- " client = Client(cluster)\n",
- "\n",
- " data_dir = os.environ.get(\"MORTGAGE_DATA_DIR\", \"\") # Default to current working directory\n",
- " res = client.run(memory_info)\n",
- " # Total GPU memory on the system\n",
- " total_mem = sum(res.values()) \n",
- " # Memory of a single GPU on the machine\n",
- " # If the machine has multiple GPUs of different sizes, this is the size of the smallest GPU\n",
- " min_mem = min(res.values()) \n",
- " \n",
- " # Start year for processing mortgage data\n",
- " start_year = None\n",
- " # End year for processing mortgage data\n",
- " end_year = None\n",
- " # The number of part files to train against. \n",
- " # If not provided, default to auto selection based on GPU memory available on the system\n",
- " part_count = os.environ.get(\"part_count\")\n",
- "\n",
- " start_year, end_year, part_count, use_1GB_splits = determine_dataset(\n",
- " total_mem=total_mem, min_mem=min_mem, part_count=part_count\n",
- " )\n",
- "\n",
- " # Download data based on these parameters\n",
- " # The 2GB split mortgage performance files are used if the system has 32GB GPUs.\n",
- " # On machines with GPUs less than 32GB we use the 1GB split files (to help reduce memory load)\n",
- " get_data(data_dir, start_year, end_year, use_1GB_splits)\n",
- "\n",
- " # Initialize a GPU pool allocating 90% of GPU memory for each worker\n",
- " client.run(rmm.reinitialize, pool_allocator=True, initial_pool_size=0.9 * min_mem)\n",
- " etl_result, etl_time = run_etl(start_year, end_year, data_dir, client)\n",
- "\n",
- " # Clear the existing RMM pool post-ETL to make space for GPU accelerated XGBoost\n",
- " # This makes space for XGBoost to operate since it doesn't have visibility into the cuDF memory pool\n",
- " client.run(rmm.reinitialize, pool_allocator=False)\n",
- "\n",
- " total_file_count = len(etl_result)\n",
- " etl_result = etl_result[:part_count] # Select subset for training\n",
- " model, dmat_time, train_time = xgb_training(etl_result, client)\n",
- "\n",
- " print(\n",
- " f\"\\nTime taken to run ETL from {start_year} to {end_year}\"\n",
- " f\" ({total_file_count} parts) was {round(etl_time,4)} s\"\n",
- " )\n",
- " print(\n",
- " f\"Time taken to prepare {len(etl_result)} parts\"\n",
- " f\" for XGB training {round(dmat_time,4)} s\"\n",
- " )\n",
- " print(f\"Time taken to train XGB model {round(train_time, 4)} s\")\n",
- " print(f\"Total E2E time: {round(etl_time+dmat_time+train_time, 4)} s\")\n",
- " client.close()\n",
- " cluster.close()"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "rapids-14-may6",
- "language": "python",
- "name": "rapids-14-may6"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.7.6"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
diff --git a/intermediate_notebooks/E2E/mortgage/utils/Data_Spec.json b/intermediate_notebooks/E2E/mortgage/utils/Data_Spec.json
deleted file mode 100644
index d69e0463..00000000
--- a/intermediate_notebooks/E2E/mortgage/utils/Data_Spec.json
+++ /dev/null
@@ -1,40 +0,0 @@
-{
- "SpecInfo":
- [
- {
- "Total_Mem" : 511e9,
- "Start_Year" : 2000,
- "End_Year" : 2016,
- "Part_Count" : [48, 96]
- },
-
- {
- "Total_Mem" : 255e9,
- "Start_Year" : 2000,
- "End_Year" : 2016,
- "Part_Count" : [24, 48]
- },
-
- {
- "Total_Mem" : 127e9,
- "Start_Year" : 2000,
- "End_Year" : 2007,
- "Part_Count" : [16, 24]
- },
-
- {
- "Total_Mem" : 47e9,
- "Start_Year" : 2000,
- "End_Year" : 2004,
- "Part_Count" : [8, 12]
- },
-
- {
- "Total_Mem" : 15e9,
- "Start_Year" : 2000,
- "End_Year" : 2000,
- "Part_Count" : [2, 3]
- }
-
- ]
-}
\ No newline at end of file
diff --git a/intermediate_notebooks/E2E/mortgage/utils/utils.py b/intermediate_notebooks/E2E/mortgage/utils/utils.py
deleted file mode 100644
index a0745185..00000000
--- a/intermediate_notebooks/E2E/mortgage/utils/utils.py
+++ /dev/null
@@ -1,127 +0,0 @@
-from packaging import version
-import json
-import glob
-import multiprocessing
-import pynvml
-import os
-import tarfile
-import urllib
-
-# Global variables
-
-# Links to mortgage data files
-MORTGAGE_YEARLY_1GB_SPLITS_URL = "https://rapidsai-data.s3.us-east-2.amazonaws.com/notebook-mortgage-data/mortgage_yearly/"
-MORTGAGE_YEARLY_2GB_SPLITS_URL = "https://rapidsai-data.s3.us-east-2.amazonaws.com/notebook-mortgage-data/mortgage_yearly_2gb/"
-
-
-def get_data(data_dir, start_year, end_year, use_1GB_splits):
- """
- Utility to download and extract mortgage data to specied data_dir.
- Only specific years of data between `start_year` and `end_year` will be downloaded
- to the specified directory
- """
- if use_1GB_splits:
- data_url = MORTGAGE_YEARLY_1GB_SPLITS_URL
- else:
- data_url = MORTGAGE_YEARLY_2GB_SPLITS_URL
- for year in range(start_year, end_year + 1):
- if not os.path.isfile(data_dir + "acq/Acquisition_" + str(year) + "Q4.txt"):
- print(f"Downloading data for year {year}")
- filename = "mortgage_" + str(year)
- filename += "_1gb.tgz" if use_1GB_splits else "_2GB.tgz"
- urllib.request.urlretrieve(data_url + filename, data_dir + filename)
- print(f"Download complete")
- print(f"Decompressing and extracting data")
-
- tar = tarfile.open(data_dir + filename, mode="r:gz")
- tar.extractall(path=data_dir)
- tar.close()
- print(f"Done extracting year {year}")
-
- if not os.path.isfile(data_dir + "names.csv"):
- urllib.request.urlretrieve(data_url + "names.csv", data_dir + "names.csv")
-
-
-def _read_data_spec(filename=os.path.dirname(__file__) + "/Data_Spec.json"):
- """
- Read the Data_Spec json
- """
- with open(filename) as f:
- data_spec = json.load(f)
-
- try:
- spec_list = data_spec["SpecInfo"]
- except KeyError:
- raise ValueError(f"SpecInfo missing in Data spec file: {filename}")
- return spec_list
-
-
-def determine_dataset(total_mem, min_mem, part_count=None):
- """
- Determine params and dataset to use
- based on Data spec sheet and available memory
- """
- start_year = None # start year for etl proessing
- end_year = None # end year for etl processing (inclusive)
-
- use_1GB_splits = True
- if min_mem >= 31.5e9:
- use_1GB_splits = False
-
- spec_list = _read_data_spec()
- # Assumption that spec_list has elements with mem_requirement
- # in Descending order
-
- # TODO: Code duplication. Consolidate into one
- if part_count:
- part_count = int(part_count)
- for i, spec in enumerate(spec_list):
- spec_part_count = (
- spec["Part_Count"][1] if use_1GB_splits else spec["Part_Count"][0]
- )
- if part_count > spec_part_count:
- start_year = spec_list[i-1]["Start_Year"] if i>0 else spec["Start_Year"]
- end_year = spec_list[i-1]["End_Year"] if i>0 else spec["End_Year"]
- break
- if not start_year:
- start_year = spec_list[-1]["Start_Year"]
- end_year = spec_list[-1]["End_Year"]
-
- else:
- for spec in spec_list:
- spec_part_count = (
- spec["Part_Count"][1] if use_1GB_splits else spec["Part_Count"][0]
- )
- if total_mem >= spec["Total_Mem"]:
- start_year = spec["Start_Year"]
- end_year = spec["End_Year"]
- part_count = spec_part_count
- break
-
- return (start_year, end_year, part_count, use_1GB_splits)
-
-
-def memory_info():
- """
- Assumes identical GPUs in a node
- """
- pynvml.nvmlInit()
- handle = pynvml.nvmlDeviceGetHandleByIndex(0)
- gpu_mem = pynvml.nvmlDeviceGetMemoryInfo(handle).total
- pynvml.nvmlShutdown()
- return gpu_mem
-
-
-def get_num_files(start_year, end_year, perf_dir):
- """
- Get number of files to read given start_year
- end_year and path to performance files
- """
- count = 0
- for year in range(start_year, end_year + 1):
- count += len(glob.glob(perf_dir + f"/*{year}*"))
- return count
-
-
-def get_cpu_cores():
- return multiprocessing.cpu_count()
diff --git a/intermediate_notebooks/E2E/synthetic_3D/rapids_ml_workflow_demo.ipynb b/intermediate_notebooks/E2E/synthetic_3D/rapids_ml_workflow_demo.ipynb
deleted file mode 100644
index ebfb40cb..00000000
--- a/intermediate_notebooks/E2E/synthetic_3D/rapids_ml_workflow_demo.ipynb
+++ /dev/null
@@ -1,1513 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "
"
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "fig = plt.figure(figsize=(100,50))\n",
- "plot_tree(xgBoostModelGPU, num_trees=0, ax=plt.subplot(1,1,1))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Visualize Class Predictions"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 51,
- "metadata": {},
- "outputs": [],
- "source": [
- "def map_colors_to_clusters_topK ( dataset, labels, topK=None, cmapName = 'tab10'):\n",
- " if topK == None:\n",
- " topK = dataset.shape[0]\n",
- " \n",
- " colorStack = np.zeros((topK, 3), dtype=np.float32)\n",
- " \n",
- " cMap = plt.get_cmap(cmapName)\n",
- " for iColor in range ( topK ):\n",
- " colorStack[iColor] = cMap.colors[ labels[iColor] ]\n",
- " \n",
- " return colorStack "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 52,
- "metadata": {},
- "outputs": [],
- "source": [
- "colorStackClassifier = map_colors_to_clusters_topK ( pd_X_test, yPredTestGPU.astype(np.int), topK=None )"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "plot_data( pd_X_test, colorStack= colorStackClassifier)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "-------\n",
- "# Extensions\n",
- "-------\n",
- "For extensions to this work visit github.com/miroenev/rapids"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "-----\n",
- "# End [ thanks! ]"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.7.3"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
diff --git a/intermediate_notebooks/benchmarks/cugraph_benchmarks/README.md b/intermediate_notebooks/benchmarks/cugraph_benchmarks/README.md
deleted file mode 100644
index 80d5a3d5..00000000
--- a/intermediate_notebooks/benchmarks/cugraph_benchmarks/README.md
+++ /dev/null
@@ -1,108 +0,0 @@
-# cuGraph Benchmarking
-
-This folder contains a collection of graph algorithm benchmarking notebooks. Each notebook will compare one cuGraph algorithm against the equivalent NetworkX version. In some cases, additional popular implementations are also tested.
-
-Before any benchmarking can be done, it is important to fir download the test data sets.
-
-
-## Getting the Data Sets
-
-Run the data prep script.
-
-```bash
-sh ./dataPrep.sh
-```
-
-## Benchmarks
-
-1. Louvain
-2. PageRank
-3. BSF
-4. SSSP
-
-
-
-The benchmark does not include data reading time, but does include:
-
-- Creating the Graph object
-- Running the analytic
-
-
-
-
-
-
-#### The data prep script
-By default, each files would be created in its own directory. The goal here is to have all the MTX files in a single directory.
-
-
-```bash
-#!/bin/bash
-
-mkdir data
-cd data
-mkdir tmp
-cd tmp
-
-wget https://sparse.tamu.edu/MM/DIMACS10/preferentialAttachment.tar.gz
-wget https://sparse.tamu.edu/MM/DIMACS10/caidaRouterLevel.tar.gz
-wget https://sparse.tamu.edu/MM/DIMACS10/coAuthorsDBLP.tar.gz
-wget https://sparse.tamu.edu/MM/LAW/dblp-2010.tar.gz
-wget https://sparse.tamu.edu/MM/DIMACS10/citationCiteseer.tar.gz
-wget https://sparse.tamu.edu/MM/DIMACS10/coPapersDBLP.tar.gz
-wget https://sparse.tamu.edu/MM/DIMACS10/coPapersCiteseer.tar.gz
-wget https://sparse.tamu.edu/MM/SNAP/as-Skitter.tar.gz
-
-tar xvzf preferentialAttachment.tar.gz
-tar xvzf caidaRouterLevel.tar.gz
-tar xvzf coAuthorsDBLP.tar.gz
-tar xvzf dblp-2010.tar.gz
-tar xvzf citationCiteseer.tar.gz
-tar xvzf coPapersDBLP.tar.gz
-tar xvzf coPapersCiteseer.tar.gz
-tar xvzf as-Skitter.tar.gz
-
-cd ..
-
-find ./tmp -name *.mtx -exec mv {} . \;
-
-rm -rf tmp
-```
-
-
-
-**About the Test files**
-
-| File Name | Num of Vertices | Num of Edges | Format | Graph Type | Symmetric |
-| ---------------------- | --------------: | -----------: |--------|---------------------------|-------------|
-| preferentialAttachment | 100,000 | 999,970 | MTX | Random Undirected Graph | Yes |
-| caidaRouterLevel | 192,244 | 1,218,132 | MTX | Undirected Graph | Yes |
-| coAuthorsDBLP | 299,067 | 1,955,352 |MTX | Undirected Graph | Yes |
-| dblp-2010 | 326,186 | 1,615,400 | MTX | Undirected Graph | Yes |
-| citationCiteseer | 268,495 | 2,313,294 | MTX | Undirected Graph | Yes |
-| coPapersDBLP | 540,486 | 30,491,458 | MTX | Undirected Graph | Yes |
-| coPapersCiteseer | 434,102 | 32,073,440 | MTX | Undirected Graph | Yes |
-| as-Skitter | 1,696,415 | 22,190,596 | MTX | Undirected Graph | Yes |
-
-
-
-### Dataset Acknowlegments
-
-The dataset are downloaded from the Texas A&M SuiteSparse Matrix Collection
-
-```
-The SuiteSparse Matrix Collection (formerly known as the University of Florida Sparse Matrix Collection), is a large and actively growing set of sparse matrices that arise in real applications.
-...
-The Collection is hosted here, and also mirrored at the University of Florida at www.cise.ufl.edu/research/sparse/matrices. The Collection is maintained by Tim Davis, Texas A&M University (email: davis@tamu.edu), Yifan Hu, Yahoo! Labs, and Scott Kolodziej, Texas A&M University.
-```
-
-| File Name | Author |
-| ---------------------- |----------------|
-| preferentialAttachment | H. Meyerhenke |
-| caidaRouterLevel | Unknown |
-| coAuthorsDBLP | R. Geisberger, P. Sanders, and D. Schultes |
-| dblp-2010 | Laboratory for Web Algorithmics (LAW), |
-| citationCiteseer | R. Geisberger, P. Sanders, and D. Schultes |
-| coPapersDBLP | R. Geisberger, P. Sanders, and D. Schultes |
-| coPapersCiteseer | R. Geisberger, P. Sanders, and D. Schultes |
-| as-Skitter | J. Leskovec, J. Kleinberg and C. Faloutsos |
diff --git a/intermediate_notebooks/benchmarks/cugraph_benchmarks/bfs_benchmark.ipynb b/intermediate_notebooks/benchmarks/cugraph_benchmarks/bfs_benchmark.ipynb
deleted file mode 100644
index 366c6b65..00000000
--- a/intermediate_notebooks/benchmarks/cugraph_benchmarks/bfs_benchmark.ipynb
+++ /dev/null
@@ -1,329 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# BFS Performance Benchmarking\n",
- "\n",
- "This notebook benchmarks performance of running BFS within cuGraph against NetworkX. \n",
- "\n",
- "Notebook Credits\n",
- "\n",
- " Original Authors: Bradley Rees\n",
- " Last Edit: 10/30/2019\n",
- " \n",
- "RAPIDS Versions: 0.10.0\n",
- "\n",
- "Test Hardware\n",
- "\n",
- " GV100 32G, CUDA 10,0\n",
- " Intel(R) Core(TM) CPU i7-7800X @ 3.50GHz\n",
- " 32GB system memory\n",
- " \n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Test Data\n",
- "\n",
- "| File Name | Num of Vertices | Num of Edges |\n",
- "|:---------------------- | --------------: | -----------: |\n",
- "| preferentialAttachment | 100,000 | 999,970 |\n",
- "| caidaRouterLevel | 192,244 | 1,218,132 |\n",
- "| coAuthorsDBLP | 299,067 | 1,955,352 |\n",
- "| dblp-2010 | 326,186 | 1,615,400 |\n",
- "| citationCiteseer | 268,495 | 2,313,294 |\n",
- "| coPapersDBLP | 540,486 | 30,491,458 |\n",
- "| coPapersCiteseer | 434,102 | 32,073,440 |\n",
- "| as-Skitter | 1,696,415 | 22,190,596 |\n",
- "\n",
- "\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Timing \n",
- "What is not timed: Reading the data\n",
- "What is timmed: (1) creating a Graph, (2) running BSF\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## NOTICE:\n",
- "You must have run the dataPrep script prior to running this notebook so that the data is downloaded\n",
- "\n",
- "See the README file in this folder for a discription of how to get the data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Import needed libraries\n",
- "import gc\n",
- "import time\n",
- "import rmm\n",
- "import cugraph\n",
- "import cudf"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# NetworkX libraries\n",
- "import networkx as nx\n",
- "from scipy.io import mmread"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import matplotlib.pyplot as plt; plt.rcdefaults()\n",
- "import numpy as np"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Get Data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "!bash dataPrep.sh"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Define the test data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Test File\n",
- "data = {\n",
- " 'preferentialAttachment' : './data/preferentialAttachment.mtx',\n",
- " 'caidaRouterLevel' : './data/caidaRouterLevel.mtx',\n",
- " 'coAuthorsDBLP' : './data/coAuthorsDBLP.mtx',\n",
- " 'dblp' : './data/dblp-2010.mtx',\n",
- " 'citationCiteseer' : './data/citationCiteseer.mtx',\n",
- " 'coPapersDBLP' : './data/coPapersDBLP.mtx',\n",
- " 'coPapersCiteseer' : './data/coPapersCiteseer.mtx',\n",
- " 'as-Skitter' : './data/as-Skitter.mtx'\n",
- "}"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Define the testing functions"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Data reader - the file format is MTX, so we will use the reader from SciPy\n",
- "def read_mtx_file(mm_file):\n",
- " print('Reading ' + str(mm_file) + '...')\n",
- " M = mmread(mm_file).asfptype()\n",
- " \n",
- " return M"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "metadata": {},
- "outputs": [],
- "source": [
- "# CuGraph BFS\n",
- "\n",
- "def cugraph_call(M):\n",
- "\n",
- " gdf = cudf.DataFrame()\n",
- " gdf['src'] = M.row\n",
- " gdf['dst'] = M.col\n",
- " \n",
- " print('\\tcuGraph Solving... ')\n",
- " \n",
- " t1 = time.time()\n",
- " \n",
- " # cugraph Pagerank Call\n",
- " G = cugraph.Graph()\n",
- " G.from_cudf_edgelist(gdf, source='src', destination='dst')\n",
- " \n",
- " df = cugraph.bfs(G, 1)\n",
- " t2 = time.time() - t1\n",
- " \n",
- " return t2\n",
- " "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Basic NetworkX BFS\n",
- "\n",
- "def networkx_call(M):\n",
- " nnz_per_row = {r: 0 for r in range(M.get_shape()[0])}\n",
- " for nnz in range(M.getnnz()):\n",
- " nnz_per_row[M.row[nnz]] = 1 + nnz_per_row[M.row[nnz]]\n",
- " for nnz in range(M.getnnz()):\n",
- " M.data[nnz] = 1.0/float(nnz_per_row[M.row[nnz]])\n",
- "\n",
- " M = M.tocsr()\n",
- " if M is None:\n",
- " raise TypeError('Could not read the input graph')\n",
- " if M.shape[0] != M.shape[1]:\n",
- " raise TypeError('Shape is not square')\n",
- "\n",
- " # should be autosorted, but check just to make sure\n",
- " if not M.has_sorted_indices:\n",
- " print('sort_indices ... ')\n",
- " M.sort_indices()\n",
- "\n",
- " z = {k: 1.0/M.shape[0] for k in range(M.shape[0])}\n",
- " \n",
- " print('\\tNetworkX Solving... ')\n",
- " \n",
- " # start timer\n",
- " t1 = time.time()\n",
- " \n",
- " Gnx = nx.DiGraph(M)\n",
- "\n",
- " pr = nx.bfs_edges(Gnx, 1)\n",
- " \n",
- " t2 = time.time() - t1\n",
- "\n",
- " return t2"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Run the benchmarks"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# arrays to capture performance gains\n",
- "perf_nx = []\n",
- "names = []\n",
- "\n",
- "for k,v in data.items():\n",
- " gc.collect()\n",
- "\n",
- " rmm.reinitialize(\n",
- " managed_memory=False,\n",
- " pool_allocator=False,\n",
- " initial_pool_size=2 << 27\n",
- " ) \n",
- " \n",
- " # Saved the file Name\n",
- " names.append(k)\n",
- " \n",
- " # read the data\n",
- " M = read_mtx_file(v)\n",
- " \n",
- " \n",
- " # call cuGraph - this will be the baseline\n",
- " trapids = cugraph_call(M)\n",
- " \n",
- " # Now call NetworkX\n",
- " tn = networkx_call(M)\n",
- " speedUp = (tn / trapids)\n",
- " perf_nx.append(speedUp)\n",
- " \n",
- " print(\"\\tcuGraph (\" + str(trapids) + \") Nx (\" + str(tn) + \")\" )"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "%matplotlib inline\n",
- "\n",
- "plt.figure(figsize=(10,8))\n",
- "\n",
- "bar_width = 0.4\n",
- "index = np.arange(len(names))\n",
- "\n",
- "_ = plt.bar(index, perf_nx, bar_width, color='g', label='vs Nx')\n",
- "\n",
- "plt.xlabel('Datasets')\n",
- "plt.ylabel('Speedup')\n",
- "plt.title('BFS Performance Speedup')\n",
- "plt.xticks(index + (bar_width / 2), names)\n",
- "plt.xticks(rotation=90) \n",
- "\n",
- "# Text on the top of each barplot\n",
- "for i in range(len(perf_nx)):\n",
- " plt.text(x = (i - .5) + bar_width, y = perf_nx[i] + 25, s = round(perf_nx[i], 1), size = 12)\n",
- "\n",
- "plt.legend()\n",
- "plt.show()"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.7.3"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
diff --git a/intermediate_notebooks/benchmarks/cugraph_benchmarks/dataPrep.sh b/intermediate_notebooks/benchmarks/cugraph_benchmarks/dataPrep.sh
deleted file mode 100755
index 34b15efa..00000000
--- a/intermediate_notebooks/benchmarks/cugraph_benchmarks/dataPrep.sh
+++ /dev/null
@@ -1,30 +0,0 @@
-#!/bin/bash
-
-mkdir data
-cd data
-mkdir tmp
-cd tmp
-
-wget https://sparse.tamu.edu/MM/DIMACS10/preferentialAttachment.tar.gz
-wget https://sparse.tamu.edu/MM/DIMACS10/caidaRouterLevel.tar.gz
-wget https://sparse.tamu.edu/MM/DIMACS10/coAuthorsDBLP.tar.gz
-wget https://sparse.tamu.edu/MM/LAW/dblp-2010.tar.gz
-wget https://sparse.tamu.edu/MM/DIMACS10/citationCiteseer.tar.gz
-wget https://sparse.tamu.edu/MM/DIMACS10/coPapersDBLP.tar.gz
-wget https://sparse.tamu.edu/MM/DIMACS10/coPapersCiteseer.tar.gz
-wget https://sparse.tamu.edu/MM/SNAP/as-Skitter.tar.gz
-
-tar xvzf preferentialAttachment.tar.gz
-tar xvzf caidaRouterLevel.tar.gz
-tar xvzf coAuthorsDBLP.tar.gz
-tar xvzf dblp-2010.tar.gz
-tar xvzf citationCiteseer.tar.gz
-tar xvzf coPapersDBLP.tar.gz
-tar xvzf coPapersCiteseer.tar.gz
-tar xvzf as-Skitter.tar.gz
-
-cd ..
-
-find ./tmp -name *.mtx -exec mv {} . \;
-
-rm -rf tmp
diff --git a/intermediate_notebooks/benchmarks/cugraph_benchmarks/louvain_benchmark.ipynb b/intermediate_notebooks/benchmarks/cugraph_benchmarks/louvain_benchmark.ipynb
deleted file mode 100644
index 2a860e35..00000000
--- a/intermediate_notebooks/benchmarks/cugraph_benchmarks/louvain_benchmark.ipynb
+++ /dev/null
@@ -1,449 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Louvain Performance Benchmarking\n",
- "\n",
- "This notebook benchmarks performance improvement of running the Louvain clustering algorithm within cuGraph against NetworkX. The test is run over eight test networks (graphs) and then results plotted. \n",
- "
\n",
- "\n",
- "\n",
- "#### Notebook Credits\n",
- "\n",
- " Original Authors: Bradley Rees\n",
- " Last Edit: 08/06/2019\n",
- "\n",
- "\n",
- "#### Test Environment\n",
- "\n",
- " RAPIDS Versions: 0.9.0\n",
- "\n",
- " Test Hardware:\n",
- " GV100 32G, CUDA 10,0\n",
- " Intel(R) Core(TM) CPU i7-7800X @ 3.50GHz\n",
- " 32GB system memory\n",
- "\n",
- "\n",
- "\n",
- "#### Updates\n",
- "- moved loading ploting libraries to front so that dependencies can be checked before running algorithms\n",
- "- added edge values \n",
- "- changed timing to including Graph creation for both cuGraph and NetworkX. This will better represent end-to-end times\n",
- "\n",
- "\n",
- "\n",
- "#### Dependencies\n",
- "- RAPIDS cuDF and cuGraph version 0.6.0 \n",
- "- NetworkX \n",
- "- Matplotlib \n",
- "- Scipy \n",
- "- data prep script run\n",
- "\n",
- "\n",
- "\n",
- "#### Note: Comparison against published results\n",
- "\n",
- "\n",
- "The cuGraph blog post included performance numbers that were collected over a year ago. For the test graphs, int32 values are now used. That improves GPUs performance. Additionally, the initial benchamrks were measured on a P100 GPU. \n",
- "\n",
- "This test only comparse the modularity scores and a success is if the scores are within 15% of each other. That comparison is done by adjusting the NetworkX modularity score and then verifying that the cuGraph score is higher.\n",
- "\n",
- "cuGraph did a full validation of NetworkX results against cuGraph results. That included cross-validation of every cluster. That test is very slow and not included here"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Import needed libraries\n",
- "import time\n",
- "import cugraph\n",
- "import cudf\n",
- "import os"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {},
- "outputs": [],
- "source": [
- "# NetworkX libraries\n",
- "try: \n",
- " import community\n",
- "except ModuleNotFoundError:\n",
- " os.system('pip install python-louvain')\n",
- " import community\n",
- "import networkx as nx\n",
- "from scipy.io import mmread"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Loading plotting libraries\n",
- "import matplotlib.pyplot as plt; plt.rcdefaults()\n",
- "import numpy as np\n",
- "import matplotlib.pyplot as plt"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "mkdir: cannot create directory 'data': File exists\n",
- "--2019-11-01 20:49:03-- https://sparse.tamu.edu/MM/DIMACS10/preferentialAttachment.tar.gz\n",
- "Resolving sparse.tamu.edu (sparse.tamu.edu)... 128.194.136.136\n",
- "Connecting to sparse.tamu.edu (sparse.tamu.edu)|128.194.136.136|:443... connected.\n",
- "HTTP request sent, awaiting response... 200 OK\n",
- "Length: 2027782 (1.9M) [application/x-gzip]\n",
- "Saving to: 'preferentialAttachment.tar.gz'\n",
- "\n",
- "preferentialAttachm 100%[===================>] 1.93M 3.48MB/s in 0.6s \n",
- "\n",
- "2019-11-01 20:49:04 (3.48 MB/s) - 'preferentialAttachment.tar.gz' saved [2027782/2027782]\n",
- "\n",
- "--2019-11-01 20:49:04-- https://sparse.tamu.edu/MM/DIMACS10/caidaRouterLevel.tar.gz\n",
- "Resolving sparse.tamu.edu (sparse.tamu.edu)... 128.194.136.136\n",
- "Connecting to sparse.tamu.edu (sparse.tamu.edu)|128.194.136.136|:443... connected.\n",
- "HTTP request sent, awaiting response... 200 OK\n",
- "Length: 2418742 (2.3M) [application/x-gzip]\n",
- "Saving to: 'caidaRouterLevel.tar.gz'\n",
- "\n",
- "caidaRouterLevel.ta 100%[===================>] 2.31M 3.76MB/s in 0.6s \n",
- "\n",
- "2019-11-01 20:49:05 (3.76 MB/s) - 'caidaRouterLevel.tar.gz' saved [2418742/2418742]\n",
- "\n",
- "--2019-11-01 20:49:05-- https://sparse.tamu.edu/MM/DIMACS10/coAuthorsDBLP.tar.gz\n",
- "Resolving sparse.tamu.edu (sparse.tamu.edu)... 128.194.136.136\n",
- "Connecting to sparse.tamu.edu (sparse.tamu.edu)|128.194.136.136|:443... connected.\n",
- "HTTP request sent, awaiting response... 200 OK\n",
- "Length: 3206075 (3.1M) [application/x-gzip]\n",
- "Saving to: 'coAuthorsDBLP.tar.gz'\n",
- "\n",
- "coAuthorsDBLP.tar.g 100%[===================>] 3.06M 3.99MB/s in 0.8s \n",
- "\n",
- "2019-11-01 20:49:06 (3.99 MB/s) - 'coAuthorsDBLP.tar.gz' saved [3206075/3206075]\n",
- "\n",
- "--2019-11-01 20:49:06-- https://sparse.tamu.edu/MM/LAW/dblp-2010.tar.gz\n",
- "Resolving sparse.tamu.edu (sparse.tamu.edu)... 128.194.136.136\n",
- "Connecting to sparse.tamu.edu (sparse.tamu.edu)|128.194.136.136|:443... connected.\n",
- "HTTP request sent, awaiting response... 200 OK\n",
- "Length: 2235407 (2.1M) [application/x-gzip]\n",
- "Saving to: 'dblp-2010.tar.gz'\n",
- "\n",
- "dblp-2010.tar.gz 100%[===================>] 2.13M 3.75MB/s in 0.6s \n",
- "\n",
- "2019-11-01 20:49:07 (3.75 MB/s) - 'dblp-2010.tar.gz' saved [2235407/2235407]\n",
- "\n",
- "--2019-11-01 20:49:07-- https://sparse.tamu.edu/MM/DIMACS10/citationCiteseer.tar.gz\n",
- "Resolving sparse.tamu.edu (sparse.tamu.edu)... 128.194.136.136\n",
- "Connecting to sparse.tamu.edu (sparse.tamu.edu)|128.194.136.136|:443... connected.\n",
- "HTTP request sent, awaiting response... 200 OK\n",
- "Length: 5082095 (4.8M) [application/x-gzip]\n",
- "Saving to: 'citationCiteseer.tar.gz'\n",
- "\n",
- "citationCiteseer.ta 100%[===================>] 4.85M 4.23MB/s in 1.1s \n",
- "\n",
- "2019-11-01 20:49:08 (4.23 MB/s) - 'citationCiteseer.tar.gz' saved [5082095/5082095]\n",
- "\n",
- "--2019-11-01 20:49:08-- https://sparse.tamu.edu/MM/DIMACS10/coPapersDBLP.tar.gz\n",
- "Resolving sparse.tamu.edu (sparse.tamu.edu)... 128.194.136.136\n",
- "Connecting to sparse.tamu.edu (sparse.tamu.edu)|128.194.136.136|:443... connected.\n",
- "HTTP request sent, awaiting response... 200 OK\n",
- "Length: 36298718 (35M) [application/x-gzip]\n",
- "Saving to: 'coPapersDBLP.tar.gz'\n",
- "\n",
- "coPapersDBLP.tar.gz 100%[===================>] 34.62M 4.93MB/s in 7.2s \n",
- "\n",
- "2019-11-01 20:49:16 (4.79 MB/s) - 'coPapersDBLP.tar.gz' saved [36298718/36298718]\n",
- "\n",
- "--2019-11-01 20:49:16-- https://sparse.tamu.edu/MM/DIMACS10/coPapersCiteseer.tar.gz\n",
- "Resolving sparse.tamu.edu (sparse.tamu.edu)... 128.194.136.136\n",
- "Connecting to sparse.tamu.edu (sparse.tamu.edu)|128.194.136.136|:443... connected.\n",
- "HTTP request sent, awaiting response... 200 OK\n",
- "Length: 36652888 (35M) [application/x-gzip]\n",
- "Saving to: 'coPapersCiteseer.tar.gz'\n",
- "\n",
- "coPapersCiteseer.ta 100%[===================>] 34.95M 4.93MB/s in 7.2s \n",
- "\n",
- "2019-11-01 20:49:23 (4.82 MB/s) - 'coPapersCiteseer.tar.gz' saved [36652888/36652888]\n",
- "\n",
- "--2019-11-01 20:49:23-- https://sparse.tamu.edu/MM/SNAP/as-Skitter.tar.gz\n",
- "Resolving sparse.tamu.edu (sparse.tamu.edu)... 128.194.136.136\n",
- "Connecting to sparse.tamu.edu (sparse.tamu.edu)|128.194.136.136|:443... connected.\n",
- "HTTP request sent, awaiting response... 200 OK\n",
- "Length: 33172905 (32M) [application/x-gzip]\n",
- "Saving to: 'as-Skitter.tar.gz'\n",
- "\n",
- "as-Skitter.tar.gz 100%[===================>] 31.64M 4.92MB/s in 6.6s \n",
- "\n",
- "2019-11-01 20:49:30 (4.79 MB/s) - 'as-Skitter.tar.gz' saved [33172905/33172905]\n",
- "\n",
- "preferentialAttachment/preferentialAttachment.mtx\n",
- "caidaRouterLevel/caidaRouterLevel.mtx\n",
- "coAuthorsDBLP/coAuthorsDBLP.mtx\n",
- "dblp-2010/dblp-2010.mtx\n",
- "citationCiteseer/citationCiteseer.mtx\n",
- "coPapersDBLP/coPapersDBLP.mtx\n",
- "coPapersCiteseer/coPapersCiteseer.mtx\n",
- "as-Skitter/as-Skitter.mtx\n",
- "find: paths must precede expression: caidaRouterLevel.mtx\n",
- "Usage: find [-H] [-L] [-P] [-Olevel] [-D help|tree|search|stat|rates|opt|exec|time] [path...] [expression]\n"
- ]
- }
- ],
- "source": [
- "!bash dataPrep.sh"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Define the test data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Test File\n",
- "data = {\n",
- " 'preferentialAttachment' : './data/preferentialAttachment.mtx',\n",
- " 'caidaRouterLevel' : './data/caidaRouterLevel.mtx',\n",
- " 'coAuthorsDBLP' : './data/coAuthorsDBLP.mtx',\n",
- " 'dblp' : './data/dblp-2010.mtx',\n",
- " 'citationCiteseer' : './data/citationCiteseer.mtx',\n",
- " 'coPapersDBLP' : './data/coPapersDBLP.mtx',\n",
- " 'coPapersCiteseer' : './data/coPapersCiteseer.mtx',\n",
- " 'as-Skitter' : './data/as-Skitter.mtx'\n",
- "}"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Define the testing functions"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Read in a dataset in MTX format \n",
- "def read_mtx_file(mm_file):\n",
- " print('Reading ' + str(mm_file) + '...')\n",
- " d = mmread(mm_file).asfptype()\n",
- " M = d.tocsr()\n",
- " \n",
- " if M is None:\n",
- " raise TypeError('Could not read the input graph')\n",
- " if M.shape[0] != M.shape[1]:\n",
- " raise TypeError('Shape is not square')\n",
- " \n",
- " return M"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Run the cuGraph Louvain analytic (using nvGRAPH function)\n",
- "def cugraph_call(M):\n",
- "\n",
- " t1 = time.time()\n",
- "\n",
- " # data\n",
- " row_offsets = cudf.Series(M.indptr)\n",
- " col_indices = cudf.Series(M.indices)\n",
- " data = cudf.Series(M.data)\n",
- " \n",
- " # create graph \n",
- " G = cugraph.Graph()\n",
- " G.add_adj_list(row_offsets, col_indices, data)\n",
- "\n",
- " # cugraph Louvain Call\n",
- " print(' cuGraph Solving... ')\n",
- " df, mod = cugraph.louvain(G) \n",
- " \n",
- " t2 = time.time() - t1\n",
- " return t2, mod\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Run the NetworkX Louvain analytic. THis is done in two parts since the modularity score is not returned \n",
- "def networkx_call(M):\n",
- " \n",
- " t1 = time.time()\n",
- "\n",
- " # Directed NetworkX graph\n",
- " Gnx = nx.Graph(M)\n",
- "\n",
- " # Networkx \n",
- " print(' NetworkX Solving... ')\n",
- " parts = community.best_partition(Gnx)\n",
- " \n",
- " # Calculating modularity scores for comparison \n",
- " mod = community.modularity(parts, Gnx) \n",
- " \n",
- " t2 = time.time() - t1\n",
- " \n",
- " return t2, mod"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Run the benchmarks"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "scrolled": true
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Reading ./data/preferentialAttachment.mtx...\n",
- " cuGraph Solving... \n",
- " NetworkX Solving... \n",
- "3509.4500202625027x faster => cugraph 0.8648371696472168 vs 3035.1028225421906\n",
- "Modularity => cugraph 0.19461682219817675 should be greater than 0.21973558127621454\n",
- "Reading ./data/caidaRouterLevel.mtx...\n",
- " cuGraph Solving... \n",
- " NetworkX Solving... \n",
- "7076.7607431556x faster => cugraph 0.04834103584289551 vs 342.0979447364807\n",
- "Modularity => cugraph 0.7872923202092253 should be greater than 0.7289947349239256\n",
- "Reading ./data/coAuthorsDBLP.mtx...\n",
- " cuGraph Solving... \n",
- " NetworkX Solving... \n",
- "11893.139026724633x faster => cugraph 0.06750750541687012 vs 802.8761472702026\n",
- "Modularity => cugraph 0.7648739273488195 should be greater than 0.7026254024456955\n",
- "Reading ./data/dblp-2010.mtx...\n",
- " cuGraph Solving... \n",
- " NetworkX Solving... \n",
- "12969.744546806074x faster => cugraph 0.07826042175292969 vs 1015.0176782608032\n",
- "Modularity => cugraph 0.7506256512679915 should be greater than 0.7450002914515801\n",
- "Reading ./data/citationCiteseer.mtx...\n",
- " cuGraph Solving... \n",
- " NetworkX Solving... \n",
- "16875.667838933237x faster => cugraph 0.07159066200256348 vs 1208.1402323246002\n",
- "Modularity => cugraph 0.6726575224227932 should be greater than 0.6845554405196591\n",
- "Reading ./data/coPapersDBLP.mtx...\n",
- " cuGraph Solving... \n",
- " NetworkX Solving... \n"
- ]
- }
- ],
- "source": [
- "# Loop through each test file and compute the speedup\n",
- "perf = []\n",
- "names = []\n",
- "\n",
- "for k,v in data.items():\n",
- " M = read_mtx_file(v)\n",
- " tr, modc = cugraph_call(M)\n",
- " tn, modx = networkx_call(M)\n",
- " \n",
- " speedUp = (tn / tr)\n",
- " names.append(k)\n",
- " perf.append(speedUp)\n",
- " \n",
- " mod_delta = (0.85 * modx)\n",
- " \n",
- " print(str(speedUp) + \"x faster => cugraph \" + str(tr) + \" vs \" + str(tn))\n",
- " print(\"Modularity => cugraph \" + str(modc) + \" should be greater than \" + str(mod_delta))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### plot the output"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "%matplotlib inline\n",
- "\n",
- "y_pos = np.arange(len(names))\n",
- " \n",
- "plt.bar(y_pos, perf, align='center', alpha=0.5)\n",
- "plt.xticks(y_pos, names)\n",
- "plt.ylabel('Speed Up')\n",
- "plt.title('Performance Speedup: cuGraph vs NetworkX')\n",
- "plt.xticks(rotation=90) \n",
- "plt.show()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.6.7"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
diff --git a/intermediate_notebooks/benchmarks/cugraph_benchmarks/pagerank_benchmark.ipynb b/intermediate_notebooks/benchmarks/cugraph_benchmarks/pagerank_benchmark.ipynb
deleted file mode 100644
index 3697fcce..00000000
--- a/intermediate_notebooks/benchmarks/cugraph_benchmarks/pagerank_benchmark.ipynb
+++ /dev/null
@@ -1,398 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# PageRank Performance Benchmarking\n",
- "\n",
- "This notebook benchmarks performance of running PageRank within cuGraph against NetworkX. NetworkX contains several implementations of PageRank. This benchmark will compare cuGraph versus the defaukt Nx implementation as well as the SciPy version\n",
- "\n",
- "Notebook Credits\n",
- "\n",
- " Original Authors: Bradley Rees\n",
- " Last Edit: 12/23/2019\n",
- " \n",
- "RAPIDS Versions: 0.12.0\n",
- "\n",
- "Test Hardware\n",
- "\n",
- " GV100 32G, CUDA 10,0\n",
- " Intel(R) Core(TM) CPU i7-7800X @ 3.50GHz\n",
- " 32GB system memory\n",
- " \n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Test Data\n",
- "\n",
- "| File Name | Num of Vertices | Num of Edges |\n",
- "|:---------------------- | --------------: | -----------: |\n",
- "| preferentialAttachment | 100,000 | 999,970 |\n",
- "| caidaRouterLevel | 192,244 | 1,218,132 |\n",
- "| coAuthorsDBLP | 299,067 | 1,955,352 |\n",
- "| dblp-2010 | 326,186 | 1,615,400 |\n",
- "| citationCiteseer | 268,495 | 2,313,294 |\n",
- "| coPapersDBLP | 540,486 | 30,491,458 |\n",
- "| coPapersCiteseer | 434,102 | 32,073,440 |\n",
- "| as-Skitter | 1,696,415 | 22,190,596 |\n",
- "\n",
- "\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Timing \n",
- "What is not timed: Reading the data\n",
- "What is timmed: (1) creating a Graph, (2) running PageRank\n",
- "\n",
- "The data file is read in once for all flavors of PageRank. Each timed block will craete a Graph and then execute the algorithm. The results of the algorithm are not compared. If you are interested in seeing the comparison of results, then please see PageRank in the __notebooks__ repo. "
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## NOTICE\n",
- "You must have run the dataPrep script prior to running this notebook so that the data is downloaded\n",
- "\n",
- "See the README file in this folder for a discription of how to get the data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Import needed libraries\n",
- "import gc\n",
- "import time\n",
- "import rmm\n",
- "import cugraph\n",
- "import cudf"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# NetworkX libraries\n",
- "import networkx as nx\n",
- "from scipy.io import mmread"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import matplotlib.pyplot as plt; plt.rcdefaults()\n",
- "import numpy as np"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Get Data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "!bash dataPrep.sh"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Define the test data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Test File\n",
- "data = {\n",
- " 'preferentialAttachment' : './data/preferentialAttachment.mtx',\n",
- " 'caidaRouterLevel' : './data/caidaRouterLevel.mtx',\n",
- " 'coAuthorsDBLP' : './data/coAuthorsDBLP.mtx',\n",
- " 'dblp' : './data/dblp-2010.mtx',\n",
- " 'citationCiteseer' : './data/citationCiteseer.mtx',\n",
- " 'coPapersDBLP' : './data/coPapersDBLP.mtx',\n",
- " 'coPapersCiteseer' : './data/coPapersCiteseer.mtx',\n",
- " 'as-Skitter' : './data/as-Skitter.mtx'\n",
- "}"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Define the testing functions"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Data reader - the file format is MTX, so we will use the reader from SciPy\n",
- "def read_mtx_file(mm_file):\n",
- " print('Reading ' + str(mm_file) + '...')\n",
- " M = mmread(mm_file).asfptype()\n",
- " \n",
- " return M"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# CuGraph PageRank\n",
- "\n",
- "def cugraph_call(M, max_iter, tol, alpha):\n",
- "\n",
- " gdf = cudf.DataFrame()\n",
- " gdf['src'] = M.row\n",
- " gdf['dst'] = M.col\n",
- " \n",
- " print('\\tcuGraph Solving... ')\n",
- " \n",
- " t1 = time.time()\n",
- " \n",
- " # cugraph Pagerank Call\n",
- " G = cugraph.Graph()\n",
- " G.from_cudf_edgelist(gdf, source='src', destination='dst')\n",
- " \n",
- " df = cugraph.pagerank(G, alpha=alpha, max_iter=max_iter, tol=tol)\n",
- " t2 = time.time() - t1\n",
- " \n",
- " return t2\n",
- " "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Basic NetworkX PageRank\n",
- "\n",
- "def networkx_call(M, max_iter, tol, alpha):\n",
- " nnz_per_row = {r: 0 for r in range(M.get_shape()[0])}\n",
- " for nnz in range(M.getnnz()):\n",
- " nnz_per_row[M.row[nnz]] = 1 + nnz_per_row[M.row[nnz]]\n",
- " for nnz in range(M.getnnz()):\n",
- " M.data[nnz] = 1.0/float(nnz_per_row[M.row[nnz]])\n",
- "\n",
- " M = M.tocsr()\n",
- " if M is None:\n",
- " raise TypeError('Could not read the input graph')\n",
- " if M.shape[0] != M.shape[1]:\n",
- " raise TypeError('Shape is not square')\n",
- "\n",
- " # should be autosorted, but check just to make sure\n",
- " if not M.has_sorted_indices:\n",
- " print('sort_indices ... ')\n",
- " M.sort_indices()\n",
- "\n",
- " z = {k: 1.0/M.shape[0] for k in range(M.shape[0])}\n",
- " \n",
- " print('\\tNetworkX Solving... ')\n",
- " \n",
- " # start timer\n",
- " t1 = time.time()\n",
- " \n",
- " Gnx = nx.DiGraph(M)\n",
- "\n",
- " pr = nx.pagerank(Gnx, alpha, z, max_iter, tol)\n",
- " \n",
- " t2 = time.time() - t1\n",
- "\n",
- " return t2"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# SciPy PageRank\n",
- "\n",
- "def networkx_scipy_call(M, max_iter, tol, alpha):\n",
- " nnz_per_row = {r: 0 for r in range(M.get_shape()[0])}\n",
- " for nnz in range(M.getnnz()):\n",
- " nnz_per_row[M.row[nnz]] = 1 + nnz_per_row[M.row[nnz]]\n",
- " for nnz in range(M.getnnz()):\n",
- " M.data[nnz] = 1.0/float(nnz_per_row[M.row[nnz]])\n",
- "\n",
- " M = M.tocsr()\n",
- " if M is None:\n",
- " raise TypeError('Could not read the input graph')\n",
- " if M.shape[0] != M.shape[1]:\n",
- " raise TypeError('Shape is not square')\n",
- "\n",
- " # should be autosorted, but check just to make sure\n",
- " if not M.has_sorted_indices:\n",
- " print('sort_indices ... ')\n",
- " M.sort_indices()\n",
- "\n",
- " z = {k: 1.0/M.shape[0] for k in range(M.shape[0])}\n",
- "\n",
- " # SciPy Pagerank Call\n",
- " print('\\tSciPy Solving... ')\n",
- " t1 = time.time()\n",
- " \n",
- " Gnx = nx.DiGraph(M) \n",
- " \n",
- " pr = nx.pagerank_scipy(Gnx, alpha, z, max_iter, tol)\n",
- " t2 = time.time() - t1\n",
- "\n",
- " return t2"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Run the benchmarks"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# arrays to capture performance gains\n",
- "perf_nx = []\n",
- "perf_sp = []\n",
- "names = []\n",
- "\n",
- "for k,v in data.items():\n",
- " gc.collect()\n",
- "\n",
- " rmm.reinitialize(\n",
- " managed_memory=False,\n",
- " pool_allocator=False,\n",
- " initial_pool_size=2 << 27\n",
- " )\n",
- " \n",
- " # Saved the file Name\n",
- " names.append(k)\n",
- " \n",
- " # read the data\n",
- " M = read_mtx_file(v)\n",
- " \n",
- " # call cuGraph - this will be the baseline\n",
- " trapids = cugraph_call(M, 100, 0.00001, 0.85)\n",
- " \n",
- " # Now call NetworkX\n",
- " tn = networkx_call(M, 100, 0.00001, 0.85)\n",
- " speedUp = (tn / trapids)\n",
- " perf_nx.append(speedUp)\n",
- " \n",
- " # Now call SciPy\n",
- " tsp = networkx_scipy_call(M, 100, 0.00001, 0.85)\n",
- " speedUp = (tsp / trapids)\n",
- " perf_sp.append(speedUp) \n",
- " \n",
- " print(\"cuGraph (\" + str(trapids) + \") Nx (\" + str(tn) + \") SciPy (\" + str(tsp) + \")\" )"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "### plot the output"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
-
- "%matplotlib inline\n",
- "\n",
- "plt.figure(figsize=(10,8))\n",
- "\n",
- "bar_width = 0.35\n",
- "index = np.arange(len(names))\n",
- "\n",
- "_ = plt.bar(index, perf_nx, bar_width, color='g', label='vs Nx')\n",
- "_ = plt.bar(index + bar_width, perf_sp, bar_width, color='b', label='vs SciPy')\n",
- "\n",
- "plt.xlabel('Datasets')\n",
- "plt.ylabel('Speedup')\n",
- "plt.title('PageRank Performance Speedup')\n",
- "plt.xticks(index + (bar_width / 2), names)\n",
- "plt.xticks(rotation=90) \n",
- "\n",
- "# Text on the top of each barplot\n",
- "for i in range(len(perf_nx)):\n",
- " plt.text(x = (i - 0.55) + bar_width, y = perf_nx[i] + 25, s = round(perf_nx[i], 1), size = 12)\n",
- "\n",
- "for i in range(len(perf_sp)):\n",
- " plt.text(x = (i - 0.1) + bar_width, y = perf_sp[i] + 25, s = round(perf_sp[i], 1), size = 12)\n",
- "\n",
- "\n",
- "plt.legend()\n",
- "plt.show()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.6.7"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
diff --git a/intermediate_notebooks/benchmarks/cugraph_benchmarks/sssp_benchmark.ipynb b/intermediate_notebooks/benchmarks/cugraph_benchmarks/sssp_benchmark.ipynb
deleted file mode 100644
index 170c72c0..00000000
--- a/intermediate_notebooks/benchmarks/cugraph_benchmarks/sssp_benchmark.ipynb
+++ /dev/null
@@ -1,331 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# SSSP Performance Benchmarking\n",
- "\n",
- "This notebook benchmarks performance of running SSSP within cuGraph against NetworkX. \n",
- "\n",
- "Notebook Credits\n",
- "\n",
- " Original Authors: Bradley Rees\n",
- " Last Edit: 12/24/2019\n",
- " \n",
- "RAPIDS Versions: 0.12.0\n",
- "\n",
- "Test Hardware\n",
- "\n",
- " GV100 32G, CUDA 10,0\n",
- " Intel(R) Core(TM) CPU i7-7800X @ 3.50GHz\n",
- " 32GB system memory\n",
- " \n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Test Data\n",
- "\n",
- "| File Name | Num of Vertices | Num of Edges |\n",
- "|:---------------------- | --------------: | -----------: |\n",
- "| preferentialAttachment | 100,000 | 999,970 |\n",
- "| caidaRouterLevel | 192,244 | 1,218,132 |\n",
- "| coAuthorsDBLP | 299,067 | 1,955,352 |\n",
- "| dblp-2010 | 326,186 | 1,615,400 |\n",
- "| citationCiteseer | 268,495 | 2,313,294 |\n",
- "| coPapersDBLP | 540,486 | 30,491,458 |\n",
- "| coPapersCiteseer | 434,102 | 32,073,440 |\n",
- "| as-Skitter | 1,696,415 | 22,190,596 |\n",
- "\n",
- "\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Timing \n",
- "What is not timed: Reading the data\n",
- "What is timmed: (1) creating a Graph, (2) running SSSP\n",
- "\n",
- "The data file is read and used for both cuGraph and NetworkX. Each timed block will craete a Graph and then execute the algorithm. The results of the algorithm are not compared. "
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## NOTICE\n",
- "You must have run the dataPrep script prior to running this notebook so that the data is downloaded\n",
- "\n",
- "See the README file in this folder for a discription of how to get the data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Import needed libraries\n",
- "import gc\n",
- "import time\n",
- "import rmm\n",
- "import cugraph\n",
- "import cudf"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# NetworkX libraries\n",
- "import networkx as nx\n",
- "from scipy.io import mmread"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import matplotlib.pyplot as plt; plt.rcdefaults()\n",
- "import numpy as np"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Get Data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "!bash dataPrep.sh"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Define the test data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Test File\n",
- "data = {\n",
- " 'preferentialAttachment' : './data/preferentialAttachment.mtx',\n",
- " 'caidaRouterLevel' : './data/caidaRouterLevel.mtx',\n",
- " 'coAuthorsDBLP' : './data/coAuthorsDBLP.mtx',\n",
- " 'dblp' : './data/dblp-2010.mtx',\n",
- " 'citationCiteseer' : './data/citationCiteseer.mtx',\n",
- " 'coPapersDBLP' : './data/coPapersDBLP.mtx',\n",
- " 'coPapersCiteseer' : './data/coPapersCiteseer.mtx',\n",
- " 'as-Skitter' : './data/as-Skitter.mtx'\n",
- "}"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Define the testing functions"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Data reader - the file format is MTX, so we will use the reader from SciPy\n",
- "def read_mtx_file(mm_file):\n",
- " print('Reading ' + str(mm_file) + '...')\n",
- " M = mmread(mm_file).asfptype()\n",
- " \n",
- " return M"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# CuGraph SSSP\n",
- "\n",
- "def cugraph_call(M, max_iter, tol, alpha):\n",
- "\n",
- " gdf = cudf.DataFrame()\n",
- " gdf['src'] = M.row\n",
- " gdf['dst'] = M.col\n",
- " \n",
- " print('\\tcuGraph Solving... ')\n",
- " \n",
- " t1 = time.time()\n",
- " \n",
- " # cugraph SSSP Call\n",
- " G = cugraph.Graph()\n",
- " G.from_cudf_edgelist(gdf, source='src', destination='dst')\n",
- " \n",
- " df = cugraph.sssp(G, 1)\n",
- " t2 = time.time() - t1\n",
- " \n",
- " return t2\n",
- " "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Basic NetworkX SSSP\n",
- "\n",
- "def networkx_call(M, max_iter, tol, alpha):\n",
- " nnz_per_row = {r: 0 for r in range(M.get_shape()[0])}\n",
- " for nnz in range(M.getnnz()):\n",
- " nnz_per_row[M.row[nnz]] = 1 + nnz_per_row[M.row[nnz]]\n",
- " for nnz in range(M.getnnz()):\n",
- " M.data[nnz] = 1.0/float(nnz_per_row[M.row[nnz]])\n",
- "\n",
- " M = M.tocsr()\n",
- " if M is None:\n",
- " raise TypeError('Could not read the input graph')\n",
- " if M.shape[0] != M.shape[1]:\n",
- " raise TypeError('Shape is not square')\n",
- "\n",
- " # should be autosorted, but check just to make sure\n",
- " if not M.has_sorted_indices:\n",
- " print('sort_indices ... ')\n",
- " M.sort_indices()\n",
- "\n",
- " z = {k: 1.0/M.shape[0] for k in range(M.shape[0])}\n",
- " \n",
- " print('\\tNetworkX Solving... ')\n",
- " \n",
- " # start timer\n",
- " t1 = time.time()\n",
- " \n",
- " Gnx = nx.DiGraph(M)\n",
- "\n",
- " pr = nx.shortest_path(Gnx, 1)\n",
- " \n",
- " t2 = time.time() - t1\n",
- "\n",
- " return t2"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Run the benchmarks"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# arrays to capture performance gains\n",
- "perf_nx = []\n",
- "names = []\n",
- "\n",
- "for k,v in data.items():\n",
- " gc.collect()\n",
- "\n",
- " rmm.reinitialize(\n",
- " managed_memory=False,\n",
- " pool_allocator=False,\n",
- " initial_pool_size=2 << 27\n",
- " ) \n",
- " \n",
- " # Saved the file Name\n",
- " names.append(k)\n",
- " \n",
- " # read the data\n",
- " M = read_mtx_file(v)\n",
- " \n",
- " # call cuGraph - this will be the baseline\n",
- " trapids = cugraph_call(M, 100, 0.00001, 0.85)\n",
- " \n",
- " # Now call NetworkX\n",
- " tn = networkx_call(M, 100, 0.00001, 0.85)\n",
- " speedUp = (tn / trapids)\n",
- " perf_nx.append(speedUp)\n",
- " \n",
- " print(\"\\tcuGraph (\" + str(trapids) + \") Nx (\" + str(tn) + \")\" )"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "%matplotlib inline\n",
- "\n",
- "plt.figure(figsize=(10,8))\n",
- "\n",
- "bar_width = 0.4\n",
- "index = np.arange(len(names))\n",
- "\n",
- "_ = plt.bar(index, perf_nx, bar_width, color='g', label='vs Nx')\n",
- "\n",
- "plt.xlabel('Datasets')\n",
- "plt.ylabel('Speedup')\n",
- "plt.title('SSSP Performance Speedup of cuGraph vs NetworkX')\n",
- "plt.xticks(index, names)\n",
- "plt.xticks(rotation=90) \n",
- "\n",
- "# Text on the top of each barplot\n",
- "for i in range(len(perf_nx)):\n",
- " #plt.text(x = (i - 0.6) + bar_width, y = perf_nx[i] + 25, s = round(perf_nx[i], 1), size = 12)\n",
- " plt.text(x = i - (bar_width/2), y = perf_nx[i] + 25, s = round(perf_nx[i], 1), size = 12)\n",
- "\n",
- "#plt.legend()\n",
- "plt.show()"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.7.3"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
diff --git a/intermediate_notebooks/benchmarks/cuml_benchmarks.ipynb b/intermediate_notebooks/benchmarks/cuml_benchmarks.ipynb
deleted file mode 100644
index 2f56d21d..00000000
--- a/intermediate_notebooks/benchmarks/cuml_benchmarks.ipynb
+++ /dev/null
@@ -1,488 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Benchmark and Bounds Tests\n",
- "\n",
- "The purpose of this notebook is to benchmark all of the single GPU cuML algorithms against their skLearn counterparts, while also providing the ability to find and verify upper bounds.\n",
- "\n",
- "Each benchmark returns a Panda with the results, which can then be analyzed, manipulated, and stored to disk. "
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Notebook Credits\n",
- "**Authorship**\n",
- "Original Author: Corey Nolet \n",
- "Last Edit: Taurean Dyer, 9/25/2019 \n",
- "\n",
- "Last Edit: Corey Nolet, 10/04/2019\n",
- " \n",
- "### Test System Specs\n",
- "Test System Hardware: DGX-1 \n",
- "Test System Software: Ubuntu 16.04 \n",
- "RAPIDS Version: 0.10.0pre - Conda Install \n",
- "Driver: 410.48\n",
- "CUDA: 10.0 \n",
- "\n",
- "### Known Working Systems\n",
- "RAPIDS Versions: 0.10+"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import cuml\n",
- "\n",
- "from cuml.benchmark.runners import SpeedupComparisonRunner\n",
- "from cuml.benchmark.algorithms import algorithm_by_name\n",
- "\n",
- "\n",
- "print(cuml.__version__)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Neighbors"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Nearest Neighbors"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "runner = cuml.benchmark.runners.SpeedupComparisonRunner(\n",
- " bench_rows=[2**x for x in range(11, 24)], \n",
- " bench_dims=[64, 128, 256],\n",
- " dataset_name=\"blobs\",\n",
- " input_type=\"numpy\")\n",
- "\n",
- "results = runner.run(algorithm_by_name(\"NearestNeighbors\"), verbose=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Clustering"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### DBSCAN"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "runner = cuml.benchmark.runners.SpeedupComparisonRunner(\n",
- " bench_rows=[2**x for x in range(11, 24)], \n",
- " bench_dims=[64, 128, 256],\n",
- " dataset_name=\"blobs\",\n",
- " input_type=\"numpy\")\n",
- "\n",
- "results = runner.run(algorithm_by_name(\"DBSCAN\"), verbose=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### K-means Clustering"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "runner = cuml.benchmark.runners.SpeedupComparisonRunner(\n",
- " bench_rows=[2**x for x in range(12, 22)], \n",
- " bench_dims=[64, 128, 256],\n",
- " dataset_name=\"blobs\",\n",
- " input_type=\"numpy\")\n",
- "\n",
- "results = runner.run(algorithm_by_name(\"KMeans\"), verbose=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Manifold Learning"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### UMAP"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "runner = cuml.benchmark.runners.SpeedupComparisonRunner(\n",
- " bench_rows=[2**x for x in range(11, 24)], \n",
- " bench_dims=[64, 128, 256],\n",
- " dataset_name=\"blobs\",\n",
- " input_type=\"numpy\")\n",
- "\n",
- "results = runner.run(algorithm_by_name(\"UMAP\"), verbose=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### T-SNE"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "runner = cuml.benchmark.runners.SpeedupComparisonRunner(\n",
- " bench_rows=[2**x for x in range(11, 24)], \n",
- " bench_dims=[64, 128, 256],\n",
- " dataset_name=\"blobs\",\n",
- " input_type=\"numpy\")\n",
- "\n",
- "results = runner.run(algorithm_by_name(\"TSNE\"), verbose=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Linear Models"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Linear Regression"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "runner = cuml.benchmark.runners.SpeedupComparisonRunner(\n",
- " bench_rows=[2**x for x in range(11, 24)], \n",
- " bench_dims=[64, 128, 256],\n",
- " dataset_name=\"blobs\",\n",
- " input_type=\"numpy\")\n",
- "\n",
- "results = runner.run(algorithm_by_name(\"LinearRegression\"), verbose=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Logistic Regression"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "runner = cuml.benchmark.runners.SpeedupComparisonRunner(\n",
- " bench_rows=[2**x for x in range(11, 24)], \n",
- " bench_dims=[64, 128, 256],\n",
- " dataset_name=\"blobs\",\n",
- " input_type=\"numpy\")\n",
- "\n",
- "results = runner.run(algorithm_by_name(\"LogisticRegression\"), verbose=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Ridge Regression"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "runner = cuml.benchmark.runners.SpeedupComparisonRunner(\n",
- " bench_rows=[2**x for x in range(11, 24)], \n",
- " bench_dims=[64, 128, 256],\n",
- " dataset_name=\"blobs\",\n",
- " input_type=\"numpy\")\n",
- "\n",
- "results = runner.run(algorithm_by_name(\"Ridge\"), verbose=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Lasso Regression"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "runner = cuml.benchmark.runners.SpeedupComparisonRunner(\n",
- " bench_rows=[2**x for x in range(11, 24)], \n",
- " bench_dims=[64, 128, 256],\n",
- " dataset_name=\"blobs\",\n",
- " input_type=\"numpy\")\n",
- "\n",
- "results = runner.run(algorithm_by_name(\"Lasso\"), verbose=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### ElasticNet Regression"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "runner = cuml.benchmark.runners.SpeedupComparisonRunner(\n",
- " bench_rows=[2**x for x in range(11, 24)], \n",
- " bench_dims=[64, 128, 256],\n",
- " dataset_name=\"blobs\",\n",
- " input_type=\"numpy\")\n",
- "\n",
- "results = runner.run(algorithm_by_name(\"ElasticNet\"), verbose=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Mini-batch SGD Classifier"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "runner = cuml.benchmark.runners.SpeedupComparisonRunner(\n",
- " bench_rows=[2**x for x in range(11, 24)], \n",
- " bench_dims=[64, 128, 256],\n",
- " dataset_name=\"blobs\",\n",
- " input_type=\"numpy\")\n",
- "\n",
- "results = runner.run(algorithm_by_name(\"MBSGDClassifier\"))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Decomposition"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### PCA"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "runner = cuml.benchmark.runners.SpeedupComparisonRunner(\n",
- " bench_rows=[2**x for x in range(11, 24)], \n",
- " bench_dims=[64, 128, 256],\n",
- " dataset_name=\"blobs\",\n",
- " input_type=\"numpy\")\n",
- "\n",
- "results = runner.run(algorithm_by_name(\"PCA\"), verbose=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### TSVD"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "runner = cuml.benchmark.runners.SpeedupComparisonRunner(\n",
- " bench_rows=[2**x for x in range(11, 24)], \n",
- " bench_dims=[64, 128, 256],\n",
- " dataset_name=\"blobs\",\n",
- " input_type=\"numpy\")\n",
- "\n",
- "results = runner.run(algorithm_by_name(\"TSVD\"), verbose=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Ensemble"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Random Forest Classifier"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "runner = cuml.benchmark.runners.SpeedupComparisonRunner(\n",
- " bench_rows=[2**x for x in range(11, 24)], \n",
- " bench_dims=[64, 128, 256],\n",
- " dataset_name=\"blobs\",\n",
- " input_type=\"numpy\")\n",
- "\n",
- "results = runner.run(algorithm_by_name(\"RandomForestClassifier\"), verbose=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Random Forest Regressor"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "runner = cuml.benchmark.runners.SpeedupComparisonRunner(\n",
- " bench_rows=[2**x for x in range(11, 24)], \n",
- " bench_dims=[64, 128, 256],\n",
- " dataset_name=\"blobs\",\n",
- " input_type=\"numpy\")\n",
- "\n",
- "results = runner.run(algorithm_by_name(\"RandomForestClassifier\"), verbose=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Random Projection"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Gaussian Random Projection"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "runner = cuml.benchmark.runners.SpeedupComparisonRunner(\n",
- " bench_rows=[2**x for x in range(17, 24)], \n",
- " bench_dims=[100, 500, 1000, 10000],\n",
- " dataset_name=\"blobs\",\n",
- " input_type=\"numpy\")\n",
- "\n",
- "results = runner.run(algorithm_by_name(\"GaussianRandomProjection\"), verbose=True)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python (cuml_dev)",
- "language": "python",
- "name": "other-env"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.6.7"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
diff --git a/intermediate_notebooks/examples/linear_regression_demo.ipynb b/intermediate_notebooks/examples/linear_regression_demo.ipynb
deleted file mode 100644
index 1e6b1d98..00000000
--- a/intermediate_notebooks/examples/linear_regression_demo.ipynb
+++ /dev/null
@@ -1,826 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "2tZ3RLnlkrkg"
- },
- "source": [
- "# Intro to Linear Regression with cuML\n",
- "Corresponding notebook to [*Beginner’s Guide to Linear Regression in Python with cuML*](http://bit.ly/cuml_lin_reg_friend) story on Medium\n",
- "\n",
- "Linear Regression is a simple machine learning model where the response `y` is modelled by a linear combination of the predictors in `X`. The `LinearRegression` function implemented in the `cuML` library allows users to change the `fit_intercept`, `normalize`, and `algorithm` parameters. \n",
- "\n",
- "Here is a brief on RAPIDS' Linear Regression parameters:\n",
- "\n",
- "- `algorithm`: 'eig' or 'svd' (default = 'eig')\n",
- " - `Eig` uses a eigen decomposition of the covariance matrix, and is much faster\n",
- " - `SVD` is slower, but guaranteed to be stable\n",
- "- `fit_intercept`: boolean (default = True)\n",
- " - If `True`, `LinearRegresssion` tries to correct for the global mean of `y`\n",
- " - If `False`, the model expects that you have centered the data.\n",
- "- `normalize`: boolean (default = False)\n",
- " - If True, the predictors in X will be normalized by dividing by it’s L2 norm\n",
- " - If False, no scaling will be done\n",
- "\n",
- "Methods that can be used with `LinearRegression` are:\n",
- "\n",
- "- `fit`: Fit the model with `X` and `y`\n",
- "- `get_params`: Sklearn style return parameter state\n",
- "- `predict`: Predicts the `y` for `X`\n",
- "- `set_params`: Sklearn style set parameter state to dictionary of params\n",
- "\n",
- "`cuML`'s `LinearRegression` expects expects either `cuDF` DataFrame or `NumPy` matrix inputs\n",
- "\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "-tG6ezqKh1Z0"
- },
- "source": [
- "Note: `CuPy` is not installed by default with RAPIDS `Conda` or `Docker` packages, but is needed for visualizing results in this notebook.\n",
- "- install with `pip` via the cell below "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "pxBcXor_0-Jd"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Requirement already satisfied: cupy in /opt/conda/envs/rapids/lib/python3.6/site-packages (7.4.0)\n",
- "Requirement already satisfied: six>=1.9.0 in /opt/conda/envs/rapids/lib/python3.6/site-packages (from cupy) (1.14.0)\n",
- "Requirement already satisfied: numpy>=1.9.0 in /opt/conda/envs/rapids/lib/python3.6/site-packages (from cupy) (1.18.4)\n",
- "Requirement already satisfied: fastrlock>=0.3 in /opt/conda/envs/rapids/lib/python3.6/site-packages (from cupy) (0.4)\n"
- ]
- }
- ],
- "source": [
- "# install cupy\n",
- "!pip install cupy"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "N20le3_KlP3O"
- },
- "source": [
- "## Load data\n",
- "- for this demo, we will be utilizing the Boston housing dataset from `sklearn`\n",
- " - start by loading in the set and printing a map of the contents"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 34
- },
- "colab_type": "code",
- "id": "RFE-nxxlTajg",
- "outputId": "04f89e88-61a3-4dd2-9088-123b410e508c"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])\n"
- ]
- }
- ],
- "source": [
- "from sklearn.datasets import load_boston\n",
- "\n",
- "# load Boston dataset\n",
- "boston = load_boston()\n",
- "\n",
- "# let's see what's inside\n",
- "print(boston.keys())"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "wmcO8dxO0uOB"
- },
- "source": [
- "#### Boston house prices dataset\n",
- "- a description of the dataset is provided in `DESCR`\n",
- " - let's explore "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 923
- },
- "colab_type": "code",
- "id": "c3kLHAsP-Al2",
- "outputId": "02518c3c-7767-42a7-b6f4-6756ace741cc"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- ".. _boston_dataset:\n",
- "\n",
- "Boston house prices dataset\n",
- "---------------------------\n",
- "\n",
- "**Data Set Characteristics:** \n",
- "\n",
- " :Number of Instances: 506 \n",
- "\n",
- " :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.\n",
- "\n",
- " :Attribute Information (in order):\n",
- " - CRIM per capita crime rate by town\n",
- " - ZN proportion of residential land zoned for lots over 25,000 sq.ft.\n",
- " - INDUS proportion of non-retail business acres per town\n",
- " - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\n",
- " - NOX nitric oxides concentration (parts per 10 million)\n",
- " - RM average number of rooms per dwelling\n",
- " - AGE proportion of owner-occupied units built prior to 1940\n",
- " - DIS weighted distances to five Boston employment centres\n",
- " - RAD index of accessibility to radial highways\n",
- " - TAX full-value property-tax rate per $10,000\n",
- " - PTRATIO pupil-teacher ratio by town\n",
- " - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\n",
- " - LSTAT % lower status of the population\n",
- " - MEDV Median value of owner-occupied homes in $1000's\n",
- "\n",
- " :Missing Attribute Values: None\n",
- "\n",
- " :Creator: Harrison, D. and Rubinfeld, D.L.\n",
- "\n",
- "This is a copy of UCI ML housing dataset.\n",
- "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/\n",
- "\n",
- "\n",
- "This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.\n",
- "\n",
- "The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic\n",
- "prices and the demand for clean air', J. Environ. Economics & Management,\n",
- "vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics\n",
- "...', Wiley, 1980. N.B. Various transformations are used in the table on\n",
- "pages 244-261 of the latter.\n",
- "\n",
- "The Boston house-price data has been used in many machine learning papers that address regression\n",
- "problems. \n",
- " \n",
- ".. topic:: References\n",
- "\n",
- " - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.\n",
- " - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.\n",
- "\n"
- ]
- }
- ],
- "source": [
- "# what do we know about this dataset?\n",
- "print(boston.DESCR)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "wI_sB78vE297"
- },
- "source": [
- "### Build Dataframe\n",
- "- Import `cuDF` and input the data into a DataFrame \n",
- " - Then add a `PRICE` column equal to the `target` key"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 206
- },
- "colab_type": "code",
- "id": "xiMmIZ8O5scJ",
- "outputId": "fd09db1f-fb41-4494-bb8b-eab6e18c258f"
- },
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- "
\n",
- "
\n",
- "
CRIM
\n",
- "
ZN
\n",
- "
INDUS
\n",
- "
CHAS
\n",
- "
NOX
\n",
- "
RM
\n",
- "
AGE
\n",
- "
DIS
\n",
- "
RAD
\n",
- "
TAX
\n",
- "
PTRATIO
\n",
- "
B
\n",
- "
LSTAT
\n",
- "
PRICE
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
0
\n",
- "
0.00632
\n",
- "
18.0
\n",
- "
2.31
\n",
- "
0.0
\n",
- "
0.538
\n",
- "
6.575
\n",
- "
65.2
\n",
- "
4.0900
\n",
- "
1.0
\n",
- "
296.0
\n",
- "
15.3
\n",
- "
396.90
\n",
- "
4.98
\n",
- "
24.0
\n",
- "
\n",
- "
\n",
- "
1
\n",
- "
0.02731
\n",
- "
0.0
\n",
- "
7.07
\n",
- "
0.0
\n",
- "
0.469
\n",
- "
6.421
\n",
- "
78.9
\n",
- "
4.9671
\n",
- "
2.0
\n",
- "
242.0
\n",
- "
17.8
\n",
- "
396.90
\n",
- "
9.14
\n",
- "
21.6
\n",
- "
\n",
- "
\n",
- "
2
\n",
- "
0.02729
\n",
- "
0.0
\n",
- "
7.07
\n",
- "
0.0
\n",
- "
0.469
\n",
- "
7.185
\n",
- "
61.1
\n",
- "
4.9671
\n",
- "
2.0
\n",
- "
242.0
\n",
- "
17.8
\n",
- "
392.83
\n",
- "
4.03
\n",
- "
34.7
\n",
- "
\n",
- "
\n",
- "
3
\n",
- "
0.03237
\n",
- "
0.0
\n",
- "
2.18
\n",
- "
0.0
\n",
- "
0.458
\n",
- "
6.998
\n",
- "
45.8
\n",
- "
6.0622
\n",
- "
3.0
\n",
- "
222.0
\n",
- "
18.7
\n",
- "
394.63
\n",
- "
2.94
\n",
- "
33.4
\n",
- "
\n",
- "
\n",
- "
4
\n",
- "
0.06905
\n",
- "
0.0
\n",
- "
2.18
\n",
- "
0.0
\n",
- "
0.458
\n",
- "
7.147
\n",
- "
54.2
\n",
- "
6.0622
\n",
- "
3.0
\n",
- "
222.0
\n",
- "
18.7
\n",
- "
396.90
\n",
- "
5.33
\n",
- "
36.2
\n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX \\\n",
- "0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 \n",
- "1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 \n",
- "2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 \n",
- "3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 \n",
- "4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 \n",
- "\n",
- " PTRATIO B LSTAT PRICE \n",
- "0 15.3 396.90 4.98 24.0 \n",
- "1 17.8 396.90 9.14 21.6 \n",
- "2 17.8 392.83 4.03 34.7 \n",
- "3 18.7 394.63 2.94 33.4 \n",
- "4 18.7 396.90 5.33 36.2 "
- ]
- },
- "execution_count": 4,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "import cudf\n",
- "\n",
- "# build dataframe from data key\n",
- "bos = cudf.DataFrame(list(boston.data))\n",
- "# set column names to feature_names\n",
- "bos.columns = boston.feature_names\n",
- "\n",
- "# add PRICE column from target\n",
- "bos['PRICE'] = boston.target\n",
- "\n",
- "# let's see what we're working with\n",
- "bos.head()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "r2qrTxo4ljZp"
- },
- "source": [
- "### Split Train from Test\n",
- "- For basic Linear Regression, we will predict `PRICE` (Median value of owner-occupied homes) based on `TAX` (full-value property-tax rate per $10,000)\n",
- " - Go ahead and trim data to just these columns"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "spaDB10E3okF"
- },
- "outputs": [],
- "source": [
- "# simple linear regression X and Y\n",
- "X = bos['TAX']\n",
- "Y = bos['PRICE']"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "4TKLv8FjIBuI"
- },
- "source": [
- "We can now set training and testing sets for our model\n",
- "- Use `cuML`'s `train_test_split` to do this\n",
- " - Train on 70% of data\n",
- " - Test on 30% of data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 86
- },
- "colab_type": "code",
- "id": "1DC6FHsNIKH_",
- "outputId": "4c932268-7a82-4ac3-c7b9-9966ffc2b12e"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "(354,)\n",
- "(152,)\n",
- "(354,)\n",
- "(152,)\n"
- ]
- }
- ],
- "source": [
- "from cuml.preprocessing.model_selection import train_test_split\n",
- "\n",
- "# train/test split (70:30)\n",
- "sX_train, sX_test, sY_train, sY_test = train_test_split(X, Y, train_size = 0.7)\n",
- "\n",
- "# see what it looks like\n",
- "print(sX_train.shape)\n",
- "print(sX_test.shape)\n",
- "print(sY_train.shape)\n",
- "print(sY_test.shape)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "ZLVg44gAmJG7"
- },
- "source": [
- "### Predict Values\n",
- "1. fit the model with `TAX` (*X_train*) and corresponding `PRICE` (*y_train*) values \n",
- " - so it can build an understanding of their relationship \n",
- "2. predict `PRICE` (*y_test*) for a test set of `TAX` (*X_test*) values\n",
- " - and compare `PRICE` predictions to actual median house (*y_test*) values\n",
- " - use `sklearn`'s `mean_squared_error` to do this"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "0 666.0\n",
- "1 403.0\n",
- "2 193.0\n",
- "3 307.0\n",
- "4 264.0\n",
- "Name: TAX, dtype: float64"
- ]
- },
- "execution_count": 7,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "sX_train.head()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 34
- },
- "colab_type": "code",
- "id": "ZGMPloJxGtK3",
- "outputId": "664b54fe-16d5-4140-a657-3dc782574da9"
- },
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "/opt/conda/envs/rapids/lib/python3.6/site-packages/ipykernel_launcher.py:8: UserWarning: Changing solver from 'eig' to 'svd' as eig solver does not support training data with 1 column currently.\n",
- " \n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "53.207501007491125\n"
- ]
- }
- ],
- "source": [
- "from cuml import LinearRegression\n",
- "from sklearn.metrics import mean_squared_error\n",
- "\n",
- "# call Linear Regression model\n",
- "slr = LinearRegression()\n",
- "\n",
- "# train the model\n",
- "slr.fit(sX_train, sY_train)\n",
- "\n",
- "# make predictions for test X values\n",
- "sY_pred = slr.predict(sX_test)\n",
- "\n",
- "# calculate error\n",
- "mse = mean_squared_error(sY_test.to_array(), \n",
- " sY_pred.to_array())\n",
- "print(mse)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "T7BXjkPSGwqd"
- },
- "source": [
- "3. visualize prediction accuracy with `matplotlib`"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 305
- },
- "colab_type": "code",
- "id": "pp9RNPt_Iemk",
- "outputId": "22a22472-50ad-4bb3-d104-35e9e100b8b6"
- },
- "outputs": [
- {
- "data": {
- "text/plain": [
- "
"
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- }
- ],
- "source": [
- "import cupy\n",
- "import matplotlib.pyplot as plt\n",
- "\n",
- "# scatter actual and predicted results\n",
- "plt.scatter(sY_test.to_array(), sY_pred.to_array())\n",
- "\n",
- "# label graph\n",
- "plt.xlabel(\"Actual Prices: $Y_i$\")\n",
- "plt.ylabel(\"Predicted prices: $\\hat{Y}_i$\")\n",
- "plt.title(\"Prices vs Predicted prices: $Y_i$ vs $\\hat{Y}_i$\")\n",
- "\n",
- "plt.show()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "8MqX73B4s5tv"
- },
- "source": [
- "## Multiple Linear Regression \n",
- "- Our mean squared error for Simple Linear Regression looks kinda high.\n",
- " - Let's try Multiple Linear Regression (predicting based on multiple variables rather than just `TAX`) and see if that produces more accurate predictions\n",
- "\n",
- "1. Set X to contain all values that are not `PRICE` from the unsplit data\n",
- " - i.e. `CRIM`, `ZN`, `INDUS`, `CHAS`, `NOX`, `RM`, `AGE`, `DIS`, `RAD`, `TAX`, `PTRATIO`, `B`, `LSTAT`\n",
- " - Y to still represent just 1 target value (`PRICE`)\n",
- " - also from the unsplit data\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 10,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "ZtQK5-f4M0Vg"
- },
- "outputs": [],
- "source": [
- "# set X to all variables except price\n",
- "mX = bos.drop('PRICE', axis=1)\n",
- "# and, like in the simple Linear Regression, set Y to price\n",
- "mY = bos['PRICE']"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "RTYG4-UwNDsK"
- },
- "source": [
- "2. Split the data into `multi_X_train`, `multi_X_test`, `Y_train`, and `Y_test`\n",
- " - Use `cuML`'s `train_test_split`\n",
- " - And the same 70:30 train:test ratio"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 11,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 86
- },
- "colab_type": "code",
- "id": "EsKxK8u_F7t8",
- "outputId": "673a1a44-4d2f-4a45-8333-8f29782eaf65"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "(354, 13)\n",
- "(152, 13)\n",
- "(354,)\n",
- "(152,)\n"
- ]
- }
- ],
- "source": [
- "# train/test split (70:30)\n",
- "mX_train, mX_test, mY_train, mY_test = train_test_split(mX, mY, train_size = 0.7)\n",
- "\n",
- "# see what it looks like\n",
- "print(mX_train.shape)\n",
- "print(mX_test.shape)\n",
- "print(mY_train.shape)\n",
- "print(mY_test.shape)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "_Y40R17LGHsI"
- },
- "source": [
- "3. fit the model with `multi_X_train` and corresponding `PRICE` (*y_train*) values \n",
- " - so it can build an understanding of their relationships \n",
- "4. predict `PRICE` (*y_test*) for the test set of independent (*multi_X_test*) values\n",
- " - and compare `PRICE` predictions to actual median house (*y_test*) values\n",
- " - use `sklearn`'s `mean_squared_error` to do this"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 13,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 34
- },
- "colab_type": "code",
- "id": "N7qm1HuVO-1k",
- "outputId": "7e291cec-e602-4ad9-a5b3-b70d7261f63d"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "28.312087834147203\n"
- ]
- }
- ],
- "source": [
- "# call Linear Regression model\n",
- "mlr = LinearRegression()\n",
- "\n",
- "# train the model for multiple regression\n",
- "mlr.fit(mX_train, mY_train)\n",
- "\n",
- "# make predictions for test X values\n",
- "mY_pred = mlr.predict(mX_test)\n",
- "\n",
- "# calculate error\n",
- "mmse = mean_squared_error(mY_test.to_array(), mY_pred.to_array())\n",
- "print(mmse)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "jTdmleXCM_Xb"
- },
- "source": [
- "5. visualize with `matplotlib`"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 14,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 305
- },
- "colab_type": "code",
- "id": "Q83NFMK1JKvL",
- "outputId": "569cfa77-a66e-4b1b-9d70-ae4ef8e7936e"
- },
- "outputs": [
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- "
"
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "# scatter actual and predicted results\n",
- "plt.scatter(mY_test.to_array(), mY_pred.to_array())\n",
- "\n",
- "# label graph\n",
- "plt.xlabel(\"Actual Prices: $Y_i$\")\n",
- "plt.ylabel(\"Predicted prices: $\\hat{Y}_i$\")\n",
- "plt.title(\"Prices vs Predicted prices: $Y_i$ vs $\\hat{Y}_i$\")\n",
- "\n",
- "plt.show()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "2X1RA6sgtZQ6"
- },
- "source": [
- "## Conclusion\n",
- "- looks like the multiple regression we ran does provide more accurate predictions than the simple linear regression\n",
- " - this will not always be the case, so always be sure to check and confirm if the extra computing is worth it\n",
- "\n",
- "Anyways, that's how you implement both Simple and Multiple Linear Regression with `cuML`. Go forth and do great things. Thanks for stopping by!"
- ]
- }
- ],
- "metadata": {
- "accelerator": "GPU",
- "colab": {
- "collapsed_sections": [],
- "name": "LOCAL_intro_lin_reg_cuml",
- "provenance": []
- },
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.6.10"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
diff --git a/intermediate_notebooks/examples/weather.ipynb b/intermediate_notebooks/examples/weather.ipynb
deleted file mode 100644
index 604628ce..00000000
--- a/intermediate_notebooks/examples/weather.ipynb
+++ /dev/null
@@ -1,701 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Simpler Multi-GPU ETL using Dask ##\n",
- "\n",
- "A major focus of the last several RAPIDS releases is easier scaling: up *and* out.\n",
- "\n",
- "While we introduced examples of multi-gpu/multi-node data processing using Dask in our first release, it was difficult to install, configure, and launch.\n",
- "\n",
- "Running our main example, the [Mortgage Workflow](https://github.com/rapidsai/notebooks-contrib/blob/master/intermediate_notebooks/E2E/mortgage/mortgage_e2e.ipynb) required:\n",
- "\n",
- "1. Pre-splitting or downloading pre-split datasets\n",
- "2. Using a [custom shell script](https://github.com/rapidsai/notebooks/blob/master/utils/dask-setup.sh) to:\n",
- " * Check for and force shut-down of existing dask clusters\n",
- " * Set environment variables\n",
- " * Launch dask-scheduler and dask-worker processes\n",
- "3. Make limited use of Dask, only via the [`delayed` interface](http://docs.dask.org/en/latest/delayed.html)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Since our first release, we've created the [dask-cuda project](https://github.com/rapidsai/dask-cuda), which automatically handles configuring Dask worker processes to make use of available GPUs.\n",
- "\n",
- "We also improved [dask-cudf](https://github.com/rapidsai/cudf/tree/branch-0.10/python/dask_cudf) to support a variety of common ETL operations. While joins and groupbys received the most attention, dask-cudf now also supports friendlier parallel IO.\n",
- "\n",
- "The rest of this notebook demonstrates how we've addressed the above pains, and generally made scaling RAPIDS out to multiple-GPUs easier.\n",
- "\n",
- "First, let's see what GPUs we have available..."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from dask.distributed import Client, wait\n",
- "from dask_cuda import LocalCUDACluster\n",
- "import dask, dask_cudf\n",
- "from dask.diagnostics import ProgressBar\n",
- "\n",
- "# Use dask-cuda to start one worker per GPU on a single-node system\n",
- "# When you shutdown this notebook kernel, the Dask cluster also shuts down.\n",
- "cluster = LocalCUDACluster(ip='0.0.0.0')\n",
- "client = Client(cluster)\n",
- "# print client info\n",
- "client"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Ok, we've got a cluster of GPU workers. Notice also the link to the Dask status dashboard. It provides lots of useful information while running data processing tasks.\n",
- "\n",
- "## Accessing Data\n",
- "\n",
- "Now, let's download a dataset.\n",
- "\n",
- "If you're working on a local machine, you'd normally use wget, Python's `urllib` package, or another tool to pull down the data you want to analyze.\n",
- "\n",
- "For the sake of not making you wait for 200+ files to download, the cell below uses urllib to download just 20 years of weather records, and a metadata file about the stations that recorded it. You can update the `years` list if you want to download more, but it wont change the logic in the notebook either way, it'll just process more data.\n",
- "\n",
- "*Note*: The rest of the markdown commentary in this notebook assumes you're operating on all 232 years of data."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Make and set a home for your data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import os\n",
- "import urllib.request\n",
- "\n",
- "data_dir = '../../data/weather/'\n",
- "if not os.path.exists(data_dir):\n",
- " print('creating weather directory')\n",
- " os.system('mkdir ../../data/weather')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Choose and Download your data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# download weather observations\n",
- "base_url = 'ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/'\n",
- "years = list(range(2000, 2020))\n",
- "for year in years:\n",
- " fn = str(year) + '.csv.gz'\n",
- " if not os.path.isfile(data_dir+fn):\n",
- " print(f'Downloading {base_url+fn} to {data_dir+fn}')\n",
- " urllib.request.urlretrieve(base_url+fn, data_dir+fn)\n",
- " \n",
- "# download weather station metadata\n",
- "station_meta_url = 'https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt'\n",
- "if not os.path.isfile(data_dir+'ghcnd-stations.txt'):\n",
- " print('Downloading station meta..')\n",
- " urllib.request.urlretrieve(station_meta_url, data_dir+'ghcnd-stations.txt')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Alternatives to Pre-Downloading Data\n",
- "\n",
- "While downloading or copying data to your local environment is a good way to get started, many users will want other options:\n",
- "\n",
- "1. Reading directly from distributed storage, like HDFS\n",
- "2. Reading from cloud storage (S3, GCS, ADLS, etc)\n",
- "\n",
- "See [Dask Remote Data Services](http://docs.dask.org/en/latest/remote-data-services.html) for more details on supported providers, authentication, and other storage configuration options.\n",
- "\n",
- "Here's an example of reading the same weather data, conveniently available in a public Amazon S3 bucket.\n",
- "\n",
- "But first make sure your Python environment has the right packages to read from your storage system of choice.\n",
- "\n",
- "For this example: ```conda install -y s3fs```"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# these CSV files don't have headers, we specify column names manually\n",
- "names = [\"station_id\", \"date\", \"type\", \"val\"]\n",
- "# there are more fields, but only the first 4 are relevant in this notebook\n",
- "usecols = names[0:4]\n",
- "\n",
- "url = 's3://noaa-ghcn-pds/csv/1788.csv'\n",
- "dask_cudf.read_csv(url, names=names, usecols=usecols, storage_options={'anon': True})"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Reading Large & Multi-File DataSets\n",
- "\n",
- "Wait... there are many weather files: one for each year going back to the 1780s.\n",
- "\n",
- "Before RAPIDS 0.6, if you wanted to read all these files in, you'd need to either use a for-loop, manually concatenating dataframes, or use [`dask.delayed`](http://docs.dask.org/en/latest/delayed.html) functions that invoke cuDF.read_csv.\n",
- "\n",
- "Fortunately, now there's `dask_cudf.read_csv`, which supports file globs, _and_ automatically splits files into chunks that can be processed serially when needed, so you're less likely to run out of memory.\n",
- "\n",
- "When you call `dask_cudf.read_csv`, Dask reads metadata for each CSV file and tasks workers with lists of filenames & byte-ranges that they're responsible for loading with cuDF's GPU CSV reader.\n",
- "\n",
- "*Note*: compressed files are not splittable on read, but you can [repartition](https://docs.dask.org/en/latest/dataframe-best-practices.html#repartition-to-reduce-overhead) them downstream."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "weather_ddf = dask_cudf.read_csv(data_dir+'*.csv.gz', names=names, usecols=usecols, compression='gzip')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Let's Process Some Data\n",
- "\n",
- "Per the [readme](https://docs.opendata.aws/noaa-ghcn-pds/readme.html) for this dataset, multiple types of weather observations are in the same files, and each carries a different units of measure:\n",
- "\n",
- "| Observation Type | Existing Units | Action |\n",
- "| ------------- | ------------- | ------------- |\n",
- "| PRCP | Precipitation (tenths of mm) | convert to inches |\n",
- "| SNWD | Snow depth (mm) | convert to inches |\n",
- "| TMAX | tenths of degrees C | convert to fahrenheit |\n",
- "| TMIN | tenths of degrees C | convert to fahrenheit |\n",
- "\n",
- "There are more even more observation types, each with their own units of measure, but I wont list them all. In this notebook, I'm going to focus specifically on precipitation.\n",
- "\n",
- "The `type` column tells us what kind of weather observation each record represents. Ordinarily, you might use `query` to filter out subsets of records and apply different logic to each subset. However, [query doesn't support string datatypes yet](https://github.com/rapidsai/cudf/issues/111). Instead, you can use boolean indexing.\n",
- "\n",
- "For numeric types, Dask with cuDF works mostly like regular Dask. For instance, you can define new columns as combinations of other columns:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "precip_index = weather_ddf['type'] == 'PRCP'\n",
- "precip_ddf = weather_ddf[precip_index]\n",
- "\n",
- "# convert 10ths of mm to inches\n",
- "mm_to_inches = 0.0393701\n",
- "precip_ddf['val'] = precip_ddf['val'] * 1/10 * mm_to_inches"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Note: Calling .head() will read the first few rows, usually from the first partition.\n",
- "\n",
- "In our case, the first partition represents weather data from 1788. Apparently, there wasn't _any_ precipitation data collected that year:\n",
- "\n",
- "Beware in your own analyes, that you .head() from partitions that you haven't already filtered everything out of!"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "precip_ddf.get_partition(1).head()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Ok, we have a lot of weather observations. Now what?\n",
- "\n",
- "# Answering Questions With Data ##\n",
- "\n",
- "For some reason, residents of particular cities like to lay claim to having the best, or the worst of something. For Los Angeles, it's having the worst traffic. New Yorkers and Chicagoans argue over who has the best pizza. [West Coasters argue about who has the most rain](https://twitter.com/MikeNiccoABC7/status/1105184947663396864).\n",
- "\n",
- "Well... as a longtime Atlanta resident suffering from humidity exhaustion, I like to joke that with all the spring showers, _Atlanta_ is the new Seattle.\n",
- "\n",
- "Does my theory hold water? Or will the data rain on my bad pun parade?"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# How Can I Test My Theory?\n",
- "\n",
- "We've already created `precip_df`, which is only the precipitation observations, but it's for all 100k weather stations, most of them no-where near Atlanta, and this is time-series data, so we'll need to aggregate over time ranges.\n",
- "\n",
- "To get down to just Atlanta and Seattle precipitation records, we have to...\n",
- "\n",
- "1. Extract year, month, and day from the compound \"date\" column, so that we can compare total rainfall across time.\n",
- "\n",
- "2. Load up the station metadata file.\n",
- "\n",
- "3. There's no \"city\" in the station metadata, so we'll do some geo-math and keep only stations near Atlanta and Seattle.\n",
- "\n",
- "4. Use a Groupby to compare changing precipitation patterns across time\n",
- "\n",
- "5. Use inner joins to filter the precipitation dataframe down to just Atlanta & Seattle data."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 1. Extracting Finer Grained Date Fields\n",
- "\n",
- "We _can_ do a bit of math to separate date parts.."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "precip_ddf['year'] = precip_ddf['date']/10000\n",
- "precip_ddf['year'] = precip_ddf['year'].astype('int')\n",
- "\n",
- "precip_ddf['month'] = (precip_ddf['date'] - precip_ddf['year']*10000)/100\n",
- "precip_ddf['month'] = precip_ddf['month'].astype('int')\n",
- "\n",
- "precip_ddf['day'] = (precip_ddf['date'] - precip_ddf['year']*10000 - precip_ddf['month']*100)\n",
- "precip_ddf['day'] = precip_ddf['day'].astype('int')\n",
- "\n",
- "precip_ddf.get_partition(1).head()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "For this dataset, getting date parts is easier with string slicing. However, as is sometimes the case, Dask expects some aspect of cuDF's Python API to match Pandas in a way that [isn't fully compatible yet](https://github.com/rapidsai/cudf/issues/2367).\n",
- "\n",
- "That bug will likely be resolved quickly. But, this example is a good chance to show how to workaround similar problems.\n",
- "\n",
- "Dask has a [map_partitions](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.Series.map_partitions) function which will apply a given Python function to all partitions of a distributed DataFrame. When you do this on a dask_cudf df, your input is a cuDF object:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "def get_date_parts(df):\n",
- " date_str = df['date'].astype('str')\n",
- " df['year'] = date_str.str.slice(0, 4).astype('int')\n",
- " df['month'] = date_str.str.slice(4, 6).astype('int')\n",
- " df['day'] = date_str.str.slice(6, 8).astype('int')\n",
- " return df\n",
- "\n",
- "# any single-GPU function that works in cuDF may be called via dask.map_partitions\n",
- "precip_ddf = precip_ddf.map_partitions(get_date_parts)\n",
- "precip_ddf.get_partition(1).head()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The map_partitions pattern is also useful whenever there are cuDF specific functions without a direct mapping into Dask."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 2. Loading Station Metadata ##"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "!head -n 5 /data/weather/ghcnd-stations.txt"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Wait... That's no CSV file! It's fixed-width!\n",
- "\n",
- "That's annoying because we don't have a reader for it. We could use CPU code to pre-process the file, making it friendlier for loading into a DataFrame, but, RAPIDS is about end-to-end data processing without leaving the GPU.\n",
- "\n",
- "This file is small enough that we can handle it directly with cuDF on a single GPU.\n",
- "\n",
- "*Warning*: Make sure you [create your dask-cuda cluster _before_ importing cudf](https://github.com/rapidsai/dask-cuda/issues/32).\n",
- "\n",
- "Here's how to cleanup this metadata using cuDF and string operations:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import cudf\n",
- "\n",
- "fn = data_dir+'ghcnd-stations.txt'\n",
- "# There are no '|' chars in the file. Use that to read the file as a single column per line\n",
- "# quoting=3 handles misplaced quotes in the `name` field \n",
- "station_df = cudf.read_csv(fn, sep='|', quoting=3, names=['lines'], header=None)\n",
- "\n",
- "# you can use normal DataFrame .str accessor, and chain operators together\n",
- "station_df['station_id'] = station_df['lines'].str.slice(0, 11).str.strip()\n",
- "station_df['latitude'] = station_df['lines'].str.slice(12, 20).str.strip()\n",
- "station_df['longitude'] = station_df['lines'].str.slice(21, 30).str.strip()\n",
- "station_df = station_df.drop('lines')\n",
- "\n",
- "station_df.head()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Managing Memory\n",
- "\n",
- "While GPU memory is very fast, there's less of it than host RAM. It's a good idea to avoid storing lots of columns that aren't useful for what you're trying to do, especially when they're strings.\n",
- "\n",
- "For example, for the station metadata, there are more columns than we parsed out above. In this workflow we only need `station_id`, `latitude`, and `longitude`, so we skipped parsing the rest of the columns.\n",
- "\n",
- "We also need to convert latitude and longitude from strings to floats, and convert the single-GPU DataFrame to a Dask DataFrame that can be distributed across workers."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# you can cast string columns to numerics\n",
- "station_df['latitude'] = station_df['latitude'].astype('float')\n",
- "station_df['longitude'] = station_df['longitude'].astype('float')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 3. Filtering Weather Stations by Distance\n",
- "\n",
- "Initially we planned to use our [existing Haversine Distance user defined function](https://medium.com/rapids-ai/user-defined-functions-in-rapids-cudf-2d7c3fc2728d) to figure out which stations are within a given distance from a city. However, that relies on a [numba CUDA JIT'ed kernel](https://numba.pydata.org/numba-doc/dev/cuda/index.html), which would be slower and would incur compilation time the first time you call it.\n",
- "\n",
- "Now that [cuSpatial](https://github.com/rapidsai/cuspatial) is available as [a nightly conda package](https://anaconda.org/rapidsai-nightly/cuspatial), we can use it without having to build from source:\n",
- "\n",
- "```\n",
- "conda install -c conda-forge -c rapidsai-nightly cuspatial\n",
- "```\n",
- "\n",
- "For this scenario, we've manually looked up Atlanta and Seattle's city centers and will fill `cudf.Series` with their latitude and longitude values. Then we can call a cuSpatial function to compute the distance between each station and each city."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import cuspatial\n",
- "\n",
- "# fill new Series with Atlanta lat/lng\n",
- "station_df['atlanta_lat'] = 33.7490\n",
- "station_df['atlanta_lng'] = -84.3880\n",
- "# compute distance from each station to Atlanta\n",
- "station_df['atlanta_dist'] = cuspatial.haversine_distance(\n",
- " station_df['longitude'], station_df['latitude'],\n",
- " station_df['atlanta_lng'], station_df['atlanta_lat']\n",
- ")\n",
- "\n",
- "# fill new Series with Seattle lat/lng\n",
- "station_df['seattle_lat'] = 47.6219\n",
- "station_df['seattle_lng'] = -122.3517\n",
- "# compute distance from each station to Seattle\n",
- "station_df['seattle_dist'] = cuspatial.haversine_distance(\n",
- " station_df['longitude'], station_df['latitude'],\n",
- " station_df['seattle_lng'], station_df['seattle_lat']\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Checking the Results"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Inspect the results:\n",
- "atlanta_stations_df = station_df.query('atlanta_dist <= 25')\n",
- "seattle_stations_df = station_df.query('seattle_dist <= 25')\n",
- "\n",
- "print(f'Atlanta Stations: {len(atlanta_stations_df)}')\n",
- "print(f'Seattle Stations: {len(seattle_stations_df)}')\n",
- "\n",
- "atlanta_stations_df.head()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "[Google tells me those station ids are from Smyrna](https://geographic.org/global_weather/georgia/smyrna_23_ne_002.html), a town just outside of Atlanta's perimeter. Our distance calculation worked!"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 4. Grouping & Aggregating by Time Range\n",
- "\n",
- "Before using an inner join to filter down to city-specific precipitation data, we can use a groupby to sum the precipitation for station and year. That'll allow the join to proceed faster and use less memory.\n",
- "\n",
- "One total precipitation record per station per year is relatively small, and we're going to need to graph this data, so we'll go ahead and `compute()` the result, asking Dask to aggregate across the 200+ years worth of data, bringing the results back to the client as a single GPU cuDF DataFrame.\n",
- "\n",
- "Note that with Dask, data is partitioned and distributed across multiple workers. Some operations require that workers \"[shuffle](http://docs.dask.org/en/latest/dataframe-groupby.html#)\" data from their partitions back and forth across the network, which has major performance implications. Today join, groupby, and sort operations can be fairly network constrained.\n",
- "\n",
- "See the [slides](https://www.slideshare.net/MatthewRocklin/ucxpython-a-flexible-communication-library-for-python-applications) from a recent talk at GTC San Jose to learn more about [ongoing efforts to integrate Dask with UCX](https://github.com/rapidsai/ucx-py/) and allow it to use accelerated networking hardware like Infiniband and [nvlink](https://www.nvidia.com/en-us/data-center/nvlink/).\n",
- "\n",
- "In the meantime, distributed operators that require shuffling like joins, groupbys, and sorts work, albeit not as fast as we'd like."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "precip_year_ddf = precip_ddf.groupby(by=['station_id', 'year']).val.sum()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Note that we're calling `compute` again here. This tells Dask to actually start computing the full set of processing logic defined thus far:\n",
- "\n",
- "1. Read and decompress 232 gzipped files (about 100 GB decompressed)\n",
- "2. Send to the GPU and parse\n",
- "3. Filter down to precipitation records\n",
- "4. Apply a conversion to inches\n",
- "5. Sum total inches of rain per year per each of the 108k weather stations\n",
- "6. Combine and pull results a single GPU DataFrame on the client host\n",
- "\n",
- "To wit.. this will take time."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "%time precip_year_df = precip_year_ddf.compute()\n",
- "\n",
- "# Convert from the groupby multi-indexed DataFrame back to a normal DF which we can use with merge\n",
- "precip_year_df = precip_year_df.reset_index()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 5. Using Inner Joins to Filter Weather Observations\n",
- "\n",
- "We have separate DataFrames containing Atlanta and Seattle stations, and we have our total precipitation grouped by `station_id` and `year`. Computing inner joins can let us compute total precipitation by year for just Atlanta and Seattle."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "%time atlanta_precip_df = precip_year_df.merge(atlanta_stations_df, on=['station_id'], how='inner')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "atlanta_precip_df.head()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "%time seattle_precip_df = precip_year_df.merge(seattle_stations_df, on=['station_id'], how='inner')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "seattle_precip_df.head()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Lastly, we need to normalize the total amount of rain in each city by the number of stations which collected rainfall: Seattle had twice as many stations collecting, but that doesn't mean more total rain fell! "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "atlanta_rain = atlanta_precip_df.groupby(['year']).val.sum()/len(atlanta_stations_df)\n",
- "atlanta_rain.head()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "seattle_rain = seattle_precip_df.groupby(['year']).val.sum()/len(seattle_stations_df)\n",
- "\n",
- "seattle_rain.head()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Visualizing the Answer\n",
- "\n",
- "To generate the graphs in the cells below, first you'll need to ```conda install -y python-graphviz matplotlib```"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "%matplotlib inline\n",
- "import matplotlib.pyplot as plt\n",
- "from matplotlib.pyplot import *\n",
- "\n",
- "plt.close('all')\n",
- "plt.rcParams['figure.figsize'] = [20, 10]\n",
- "\n",
- "fig, ax = subplots()\n",
- "\n",
- "atlanta_rain.to_pandas().plot(ax=ax)\n",
- "seattle_rain.to_pandas().plot(ax=ax)\n",
- "\n",
- "ax.legend(['Atlanta', 'Seattle'])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Results\n",
- "\n",
- "It looks like I'm right (mostly)! At least for roughly the last 80 years, it rains more by volume in Atlanta than it does in Seattle. The data seems to confirm my suspicions.\n",
- "\n",
- "But as usual the answer raises additional questions:\n",
- "\n",
- "1. Without singling out Atlanta and Seattle, which city actually has the most precipitation by volume?\n",
- "\n",
- "2. Why is there such a large increase in observed precipitation in the last 10 years?\n",
- "\n",
- "3. One friend noted that it rains more frequently in Seattle, just not as hard. A contrarian was quick to point out that it mists a lot in Seattle. How often is it just \"misty\", but not really raining?\n",
- "\n",
- "We'll revisit these questions in a future post, and look forward to seeing what kinds of analyses YOU come up with."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Takeaways\n",
- "\n",
- "We just showed some of the ways you can use Dask and cuDF to parallelize typical data processing tasks on multiple GPUs. Hopefully this notebook provides useful examples to refer to while doing your own ETL & analytics work.\n",
- "\n",
- "For more info on what's working today with Dask and cuDF, see [our summary](https://docs.rapids.ai/api/cudf/stable/), and follow [our ongoing development](https://github.com/rapidsai/cudf).\n",
- "\n",
- "Also checkout out other [community contributed notebooks](https://github.com/rapidsai/notebooks-contrib), and submit your own!"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.7.3"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
diff --git a/the_archive/archived_competition_notebooks/kaggle/README.md b/the_archive/archived_competition_notebooks/kaggle/README.md
new file mode 100644
index 00000000..28091c87
--- /dev/null
+++ b/the_archive/archived_competition_notebooks/kaggle/README.md
@@ -0,0 +1,9 @@
+## Open GPU Data Science
+
+# Introduction
+This repo contains rapids solutions for kaggle competitions and other real world, end to end (E2E) problems.
+1. plasticc: 8th place [Rapids.ai](https://rapids.ai) solution of [PLAsTiCC Astronomical Classification](https://www.kaggle.com/c/PLAsTiCC-2018).
+2. malware: explorative analysis of [microsoft malware prediction](https://www.kaggle.com/c/microsoft-malware-prediction).
+
+# Build and Run with Docker or bare-metal
+please find readme file in each folder.
diff --git a/conference_notebooks/KDD_2019/img/rapids_logo.png b/the_archive/archived_competition_notebooks/kaggle/img/rapids_logo.png
similarity index 100%
rename from conference_notebooks/KDD_2019/img/rapids_logo.png
rename to the_archive/archived_competition_notebooks/kaggle/img/rapids_logo.png
diff --git a/the_archive/archived_competition_notebooks/kaggle/img/solution.png b/the_archive/archived_competition_notebooks/kaggle/img/solution.png
new file mode 100644
index 00000000..2722ca3d
Binary files /dev/null and b/the_archive/archived_competition_notebooks/kaggle/img/solution.png differ
diff --git a/the_archive/archived_competition_notebooks/kaggle/landmark/cudf_stratifiedKfold_1000x_speedup.ipynb b/the_archive/archived_competition_notebooks/kaggle/landmark/cudf_stratifiedKfold_1000x_speedup.ipynb
new file mode 100644
index 00000000..7a270ede
--- /dev/null
+++ b/the_archive/archived_competition_notebooks/kaggle/landmark/cudf_stratifiedKfold_1000x_speedup.ipynb
@@ -0,0 +1,773 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "cudf version 0.7.2+0.g3ebd286.dirty\n"
+ ]
+ }
+ ],
+ "source": [
+ "import pandas as pd\n",
+ "import cudf as gd\n",
+ "import time\n",
+ "from sklearn.model_selection import StratifiedKFold\n",
+ "import numpy as np\n",
+ "import warnings\n",
+ "from numba import cuda\n",
+ "warnings.filterwarnings(\"ignore\")\n",
+ "print('cudf version',gd.__version__)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### I didn't expect to use rapids at all for the __[Google Landmark Recognition 2019](https://www.kaggle.com/c/landmark-recognition-2019)__ but it turned out that stratified kfold operation could be painfully slow. In this notebook, a cudf based implementation is shown to achieve 1000+ speedup."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Table of contents\n",
+ "[1. Implementation of cudf based stratified kfold split](#imp) \n",
+ "[2. Sanity Check with toy data](#san) \n",
+ "[3. The google landmark dataset](#land) \n",
+ "[4. Measure the runtime](#runtime) "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Implementation of cudf based stratified kfold split\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "class StratifiedKFold_cudf_gpu:\n",
+ " \"\"\"Stratified K-Folds cross-validator using cudf on gpu.\n",
+ " Functionality is the same as \n",
+ " https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html\n",
+ " Parameters\n",
+ " ----------\n",
+ " n_splits : int, default=3\n",
+ " Number of folds. Must be at least 2.\n",
+ " .. versionchanged:: 0.20\n",
+ " ``n_splits`` default value will change from 3 to 5 in v0.22.\n",
+ " shuffle : boolean, optional\n",
+ " Whether to shuffle each stratification of the data before splitting\n",
+ " into batches.\n",
+ " random_state : int, default=42, RandomState instance, which is the seed used \n",
+ " by the random number generator;\n",
+ " tpb: int, default=32, number of threads per thread block. A thread block is a group of threads \n",
+ " to process the group of samples with same value of y. If the number of unique values of \n",
+ " y is small,the group size is large and tpb should increase accordingly. The largest value\n",
+ " of tpb is 1024 and it should be multiples of 32.\n",
+ " mode: str, default = 'relax', how to deal with class with fewer samples than n_splits\n",
+ " The possible options are 'relax' and 'sklearn'. \n",
+ " With 'sklearn' mode, it will assert that n_splits must be less or equal to the number of samples\n",
+ " in the smallest class.\n",
+ " With 'relax', class with fewer samples than n_splits will be only in either train or valid part\n",
+ " of a given fold.\n",
+ " Examples\n",
+ " --------\n",
+ " >>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])\n",
+ " >>> y = np.array([0, 0, 1, 1])\n",
+ " >>> skf = StratifiedKFold_cudf_gpu(n_splits=2,random_state=None, shuffle=False)\n",
+ " >>> for train_index, test_index in skf.split(X, y):\n",
+ " ... print(\"TRAIN:\", train_index, \"TEST:\", test_index)\n",
+ " ... X_train, X_test = X[train_index], X[test_index]\n",
+ " ... y_train, y_test = y[train_index], y[test_index]\n",
+ " TRAIN: [1 3] TEST: [0 2]\n",
+ " TRAIN: [0 2] TEST: [1 3]\n",
+ " Notes\n",
+ " -----\n",
+ " Train and test sizes may be different in each fold, with a difference of at most ``n_classes``.\n",
+ " \"\"\"\n",
+ " def __init__(self,n_splits=3,shuffle=True,random_state=42,tpb=32,mode='relax'):\n",
+ " self.n_splits = n_splits\n",
+ " self.shuffle = shuffle\n",
+ " self.seed = random_state\n",
+ " self.tpb = tpb # threads per thread block\n",
+ " self.mode = mode\n",
+ " \n",
+ " def get_n_splits(self, X=None, y=None):\n",
+ " return self.n_splits\n",
+ " \n",
+ " def split(self,x,y):\n",
+ " \"\"\"Generate indices to split data into training and test set.\n",
+ " Parameters\n",
+ " ----------\n",
+ " X : array-like, shape (n_samples, n_features)\n",
+ " y : array-like, shape (n_samples,)\n",
+ " The target variable for supervised learning problems.\n",
+ " Stratification is done based on the y labels.\n",
+ " Yields\n",
+ " ------\n",
+ " train : ndarray\n",
+ " The training set indices for that split.\n",
+ " test : ndarray\n",
+ " The testing set indices for that split.\n",
+ " Notes\n",
+ " -----\n",
+ " Randomized CV splitters may return different results for each call of\n",
+ " split. You can make the results identical by setting ``random_state``\n",
+ " to an integer.\n",
+ " \"\"\"\n",
+ " assert x.shape[0] == y.shape[0]\n",
+ " df = gd.DataFrame()\n",
+ " x = np.array(x)\n",
+ " y = np.array(y)\n",
+ " ids = np.arange(x.shape[0])\n",
+ " \n",
+ " if self.shuffle:\n",
+ " np.random.seed(self.seed)\n",
+ " np.random.shuffle(ids)\n",
+ " x = x[ids]\n",
+ " y = y[ids]\n",
+ " \n",
+ " cols = []\n",
+ " df['y'] = np.ascontiguousarray(y)\n",
+ " df['ids'] = ids\n",
+ " \n",
+ " grpby = df.groupby(['y'], method=\"cudf\")\n",
+ " if self.mode == 'sklearn':\n",
+ " dg = grpby.agg({'y':'count'})\n",
+ " #print(dg.columns)\n",
+ " col = dg.columns[0]\n",
+ " msg = 'n_splits=%d cannot be greater than the number of members in each class.'%self.n_splits\n",
+ " assert dg[col].min()>=self.n_splits,msg\n",
+ "\n",
+ " def get_order_in_group(y,ids,order):\n",
+ " for i in range(cuda.threadIdx.x, len(y), cuda.blockDim.x):\n",
+ " order[i] = i\n",
+ "\n",
+ " got = grpby.apply_grouped(get_order_in_group,incols=['y','ids'],\n",
+ " outcols={'order': np.int32},\n",
+ " tpb=self.tpb)\n",
+ "\n",
+ " got = got.sort_values('ids')\n",
+ " \n",
+ " dx = got.to_pandas()\n",
+ " del got,df\n",
+ " \n",
+ " for i in range(self.n_splits):\n",
+ " mask = dx['order']%self.n_splits==i\n",
+ " train = dx.loc[~mask,'ids'].values\n",
+ " test = dx.loc[mask,'ids'].values\n",
+ " if len(test)==0:\n",
+ " break\n",
+ " yield train,test "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Sanity check\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "TRAIN: [1 3] TEST: [0 2]\n",
+ "TRAIN: [0 2] TEST: [1 3]\n"
+ ]
+ }
+ ],
+ "source": [
+ "X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])\n",
+ "y = np.array([0, 0, 1, 1])\n",
+ "skf = StratifiedKFold_cudf_gpu(n_splits=2,random_state=None, shuffle=False)\n",
+ "for train_index, test_index in skf.split(X, y):\n",
+ " print(\"TRAIN:\", train_index, \"TEST:\", test_index)\n",
+ " X_train, X_test = X[train_index], X[test_index]\n",
+ " y_train, y_test = y[train_index], y[test_index]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### We compare relax and sklearn mode of StratifiedKFold_cudf_gpu"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "TRAIN: [1 3] TEST: [0 2]\n",
+ "TRAIN: [0 2] TEST: [1 3]\n"
+ ]
+ }
+ ],
+ "source": [
+ "skf = StratifiedKFold_cudf_gpu(n_splits=4,random_state=None, shuffle=False,mode='relax')\n",
+ "for train_index, test_index in skf.split(X, y):\n",
+ " print(\"TRAIN:\", train_index, \"TEST:\", test_index)\n",
+ " X_train, X_test = X[train_index], X[test_index]\n",
+ " y_train, y_test = y[train_index], y[test_index]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "This examples shows that when the number of samples is too small for a certain `n_splits`, the `relax` mode stops making more splits without reporting an error. Please refer to [the google landmark dataset example](#land) for more behavior analysis of `relax` mode."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### Errors are intended in the following example for both sklearn and cudf. The `sklearn` mode of cudf is designed to catch the same error as sklearn version."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "ename": "AssertionError",
+ "evalue": "n_splits=4 cannot be greater than the number of members in each class.",
+ "output_type": "error",
+ "traceback": [
+ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+ "\u001b[0;31mAssertionError\u001b[0m Traceback (most recent call last)",
+ "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0mskf\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mStratifiedKFold_cudf_gpu\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mn_splits\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m4\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mrandom_state\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mshuffle\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mmode\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'sklearn'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0;32mfor\u001b[0m \u001b[0mtrain_index\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtest_index\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mskf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"TRAIN:\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtrain_index\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"TEST:\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtest_index\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0mX_train\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mX_test\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mX\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mtrain_index\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mX\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mtest_index\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0my_train\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_test\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mtrain_index\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mtest_index\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+ "\u001b[0;32m\u001b[0m in \u001b[0;36msplit\u001b[0;34m(self, x, y)\u001b[0m\n\u001b[1;32m 91\u001b[0m \u001b[0mcol\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdg\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 92\u001b[0m \u001b[0mmsg\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'n_splits=%d cannot be greater than the number of members in each class.'\u001b[0m\u001b[0;34m%\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mn_splits\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 93\u001b[0;31m \u001b[0;32massert\u001b[0m \u001b[0mdg\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mcol\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmin\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m>=\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mn_splits\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mmsg\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 94\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 95\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mget_order_in_group\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mids\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0morder\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+ "\u001b[0;31mAssertionError\u001b[0m: n_splits=4 cannot be greater than the number of members in each class."
+ ]
+ }
+ ],
+ "source": [
+ "skf = StratifiedKFold_cudf_gpu(n_splits=4,random_state=None, shuffle=False,mode='sklearn')\n",
+ "for train_index, test_index in skf.split(X, y):\n",
+ " print(\"TRAIN:\", train_index, \"TEST:\", test_index)\n",
+ " X_train, X_test = X[train_index], X[test_index]\n",
+ " y_train, y_test = y[train_index], y[test_index]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "ename": "ValueError",
+ "evalue": "n_splits=4 cannot be greater than the number of members in each class.",
+ "output_type": "error",
+ "traceback": [
+ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+ "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
+ "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0mskf\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mStratifiedKFold\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mn_splits\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m4\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mrandom_state\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mshuffle\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0;32mfor\u001b[0m \u001b[0mtrain_index\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtest_index\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mskf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"TRAIN:\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtrain_index\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"TEST:\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtest_index\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0mX_train\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mX_test\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mX\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mtrain_index\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mX\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mtest_index\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0my_train\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_test\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mtrain_index\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mtest_index\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+ "\u001b[0;32m~/anaconda3/envs/cudf0.7/lib/python3.6/site-packages/sklearn/model_selection/_split.py\u001b[0m in \u001b[0;36msplit\u001b[0;34m(self, X, y, groups)\u001b[0m\n\u001b[1;32m 333\u001b[0m .format(self.n_splits, n_samples))\n\u001b[1;32m 334\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 335\u001b[0;31m \u001b[0;32mfor\u001b[0m \u001b[0mtrain\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtest\u001b[0m \u001b[0;32min\u001b[0m \u001b[0msuper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mgroups\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 336\u001b[0m \u001b[0;32myield\u001b[0m \u001b[0mtrain\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtest\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 337\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
+ "\u001b[0;32m~/anaconda3/envs/cudf0.7/lib/python3.6/site-packages/sklearn/model_selection/_split.py\u001b[0m in \u001b[0;36msplit\u001b[0;34m(self, X, y, groups)\u001b[0m\n\u001b[1;32m 87\u001b[0m \u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mgroups\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mindexable\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mgroups\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 88\u001b[0m \u001b[0mindices\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0marange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0m_num_samples\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 89\u001b[0;31m \u001b[0;32mfor\u001b[0m \u001b[0mtest_index\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_iter_test_masks\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mgroups\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 90\u001b[0m \u001b[0mtrain_index\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mindices\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlogical_not\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtest_index\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 91\u001b[0m \u001b[0mtest_index\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mindices\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mtest_index\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+ "\u001b[0;32m~/anaconda3/envs/cudf0.7/lib/python3.6/site-packages/sklearn/model_selection/_split.py\u001b[0m in \u001b[0;36m_iter_test_masks\u001b[0;34m(self, X, y, groups)\u001b[0m\n\u001b[1;32m 684\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 685\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_iter_test_masks\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mgroups\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 686\u001b[0;31m \u001b[0mtest_folds\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_make_test_folds\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 687\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mi\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mrange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mn_splits\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 688\u001b[0m \u001b[0;32myield\u001b[0m \u001b[0mtest_folds\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0mi\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+ "\u001b[0;32m~/anaconda3/envs/cudf0.7/lib/python3.6/site-packages/sklearn/model_selection/_split.py\u001b[0m in \u001b[0;36m_make_test_folds\u001b[0;34m(self, X, y)\u001b[0m\n\u001b[1;32m 649\u001b[0m raise ValueError(\"n_splits=%d cannot be greater than the\"\n\u001b[1;32m 650\u001b[0m \u001b[0;34m\" number of members in each class.\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 651\u001b[0;31m % (self.n_splits))\n\u001b[0m\u001b[1;32m 652\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mn_splits\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0mmin_groups\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 653\u001b[0m warnings.warn((\"The least populated class in y has only %d\"\n",
+ "\u001b[0;31mValueError\u001b[0m: n_splits=4 cannot be greater than the number of members in each class."
+ ]
+ }
+ ],
+ "source": [
+ "skf = StratifiedKFold(n_splits=4,random_state=None, shuffle=False)\n",
+ "for train_index, test_index in skf.split(X, y):\n",
+ " print(\"TRAIN:\", train_index, \"TEST:\", test_index)\n",
+ " X_train, X_test = X[train_index], X[test_index]\n",
+ " y_train, y_test = y[train_index], y[test_index]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### Examples above show that the `sklearn` mode of cudf can catch the same error as sklearn stratified kfold split."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### A real world example with the google landmark dataset\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### Please download the train.csv from https://s3.amazonaws.com/google-landmark/metadata/train.csv"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "CPU times: user 472 ms, sys: 160 ms, total: 632 ms\n",
+ "Wall time: 633 ms\n"
+ ]
+ }
+ ],
+ "source": [
+ "%%time\n",
+ "path = 'train.csv'\n",
+ "cols = ['id','url','landmark_id']\n",
+ "dtypes = ['str','str','int32']\n",
+ "train = gd.read_csv(path,names=cols,dtype=dtypes,skiprows=1) # skip the header"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
id
\n",
+ "
url
\n",
+ "
landmark_id
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
0
\n",
+ "
6e158a47eb2ca3f6
\n",
+ "
https://upload.wikimedia.org/wikipedia/commons...
\n",
+ "
142820
\n",
+ "
\n",
+ "
\n",
+ "
1
\n",
+ "
202cd79556f30760
\n",
+ "
http://upload.wikimedia.org/wikipedia/commons/...
\n",
+ "
104169
\n",
+ "
\n",
+ "
\n",
+ "
2
\n",
+ "
3ad87684c99c06e1
\n",
+ "
http://upload.wikimedia.org/wikipedia/commons/...
\n",
+ "
37914
\n",
+ "
\n",
+ "
\n",
+ "
3
\n",
+ "
e7f70e9c61e66af3
\n",
+ "
https://upload.wikimedia.org/wikipedia/commons...
\n",
+ "
102140
\n",
+ "
\n",
+ "
\n",
+ "
4
\n",
+ "
4072182eddd0100e
\n",
+ "
https://upload.wikimedia.org/wikipedia/commons...
\n",
+ "
2474
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " id url \\\n",
+ "0 6e158a47eb2ca3f6 https://upload.wikimedia.org/wikipedia/commons... \n",
+ "1 202cd79556f30760 http://upload.wikimedia.org/wikipedia/commons/... \n",
+ "2 3ad87684c99c06e1 http://upload.wikimedia.org/wikipedia/commons/... \n",
+ "3 e7f70e9c61e66af3 https://upload.wikimedia.org/wikipedia/commons... \n",
+ "4 4072182eddd0100e https://upload.wikimedia.org/wikipedia/commons... \n",
+ "\n",
+ " landmark_id \n",
+ "0 142820 \n",
+ "1 104169 \n",
+ "2 37914 \n",
+ "3 102140 \n",
+ "4 2474 "
+ ]
+ },
+ "execution_count": 8,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df = train.head().to_pandas()\n",
+ "df.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "number of samples 4132914 number of classes 203094\n"
+ ]
+ }
+ ],
+ "source": [
+ "y = train['landmark_id'].to_pandas().values\n",
+ "print('number of samples %d number of classes %d'%(y.shape[0],np.unique(y).shape[0]))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[samples, classes] in each fold:\n",
+ "train [3610971,184200] valid [521943,203094]\n",
+ "train [3641533,203094] valid [491381,184200]\n",
+ "train [3670149,203094] valid [462765,166463]\n",
+ "train [3695903,203094] valid [437011,150659]\n",
+ "train [3718707,203094] valid [414207,137133]\n",
+ "train [3738707,203094] valid [394207,125731]\n",
+ "train [3756907,203094] valid [376007,115755]\n",
+ "train [3773431,203094] valid [359483,106996]\n",
+ "train [3788254,203094] valid [344660,99342]\n",
+ "train [3801664,203094] valid [331250,92740]\n",
+ "CPU times: user 1min 29s, sys: 5.24 s, total: 1min 34s\n",
+ "Wall time: 4.26 s\n"
+ ]
+ }
+ ],
+ "source": [
+ "%%time\n",
+ "skf = StratifiedKFold_cudf_gpu(n_splits=10,random_state=42, shuffle=True, mode='relax')\n",
+ "print('[samples, classes] in each fold:')\n",
+ "for train_index, test_index in skf.split(y, y):\n",
+ " print('train [%d,%d] valid [%d,%d]'%(y[train_index].shape[0],\n",
+ " np.unique(y[train_index]).shape[0],\n",
+ " y[test_index].shape[0],\n",
+ " np.unique(y[test_index]).shape[0],))\n",
+ " assert y[train_index].shape[0]+y[test_index].shape[0] == y.shape[0]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Unlike sklearn's version, where n_splits cannot be greater than the number of members in the smallest class, we use an approximate approach to allow minority class samples to be only in train part or valid part. The downside is the number of samples in each fold is not even. Actually the number of samples in valid is monotonically decreasing from fold 0 to fold n-1, and the size difference between largest fold and smallest fold increases with n_splits. Based on my limited experience, this is acceptable."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### I don't even want to run sklean's version on this dataset since it runs forever. Please feel free to try."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "\"\\n%%time\\nskf = StratifiedKFold(n_splits=4,random_state=None, shuffle=False)\\nprint('[samples, classes] in each fold:')\\nfor train_index, test_index in skf.split(y, y):\\n print('train [%d,%d] valid [%d,%d]'%(y[train_index].shape[0],\\n np.unique(y[train_index]).shape[0],\\n y[test_index].shape[0],\\n np.unique(y[test_index]).shape[0],))\\n\""
+ ]
+ },
+ "execution_count": 11,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "\"\"\"\n",
+ "%%time\n",
+ "skf = StratifiedKFold(n_splits=4,random_state=None, shuffle=False)\n",
+ "print('[samples, classes] in each fold:')\n",
+ "for train_index, test_index in skf.split(y, y):\n",
+ " print('train [%d,%d] valid [%d,%d]'%(y[train_index].shape[0],\n",
+ " np.unique(y[train_index]).shape[0],\n",
+ " y[test_index].shape[0],\n",
+ " np.unique(y[test_index]).shape[0],))\n",
+ "\"\"\" "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Measure the run time of sklearn's stratified kfold using random data\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def stratified_kfold_timing(n_splits,classes,samples,gpu=False):\n",
+ " \"\"\"measure the run time of stratified kfold split using random synthetic data.\n",
+ " Parameters\n",
+ " ----------\n",
+ " n_splits : int, \n",
+ " number of splits\n",
+ " classes : int, \n",
+ " number of classes for the synthetic data\n",
+ " samples: int, \n",
+ " number of samples for the synthetic data\n",
+ " gpu: boolean, default False,\n",
+ " use gpu based stratified split or not\n",
+ " \n",
+ " Returns\n",
+ " ------\n",
+ " samples: int, \n",
+ " number of samples for the synthetic data\n",
+ " classes : int, \n",
+ " number of classes for the synthetic data\n",
+ " duration: float,\n",
+ " run time of the stratified kfold split operation\n",
+ " \"\"\"\n",
+ " y = np.random.randint(0,classes,samples)\n",
+ " start = time.time()\n",
+ " if gpu:\n",
+ " skf = StratifiedKFold_cudf_gpu(n_splits=n_splits)\n",
+ " else:\n",
+ " skf = StratifiedKFold(n_splits=n_splits)\n",
+ " try:\n",
+ " for train_index, test_index in skf.split(y, y):\n",
+ " break # only measure the time of one fold\n",
+ " except:\n",
+ " return None,None,None\n",
+ " duration = time.time()-start\n",
+ " return samples,classes,duration"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "samples: 10000 classes: 10 time:0.0108 seconds\n",
+ "samples: 10000 classes: 100 time:0.0353 seconds\n",
+ "samples: 10000 classes: 1000 time:0.2877 seconds\n",
+ "samples: 100000 classes: 10 time:0.0643 seconds\n",
+ "samples: 100000 classes: 100 time:0.1900 seconds\n",
+ "samples: 100000 classes: 1000 time:1.2166 seconds\n",
+ "samples: 100000 classes: 10000 time:11.4666 seconds\n",
+ "samples: 1000000 classes: 10 time:0.6816 seconds\n",
+ "samples: 1000000 classes: 100 time:1.6286 seconds\n",
+ "samples: 1000000 classes: 1000 time:9.7290 seconds\n",
+ "samples: 1000000 classes: 10000 time:87.1364 seconds\n",
+ "samples: 1000000 classes: 100000 time:867.2177 seconds\n"
+ ]
+ }
+ ],
+ "source": [
+ "sklearn_split_time = []\n",
+ "for i in range(4,7):\n",
+ " for j in range(1,6):\n",
+ " samples,classes,t = stratified_kfold_timing(n_splits=10,classes=10**j,samples=10**i,gpu=False)\n",
+ " if t is None:\n",
+ " continue\n",
+ " print('samples: %d classes: %d time:%.4f seconds'%(samples,classes,t))\n",
+ " sklearn_split_time.append([samples,classes,t])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Measure the run time of cudf's stratified kfold using random data"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "samples: 10000 classes: 10 time:0.2129 seconds\n",
+ "samples: 10000 classes: 100 time:0.2013 seconds\n",
+ "samples: 10000 classes: 1000 time:0.1998 seconds\n",
+ "samples: 10000 classes: 10000 time:0.2702 seconds\n",
+ "samples: 10000 classes: 100000 time:0.1997 seconds\n",
+ "samples: 100000 classes: 10 time:0.2067 seconds\n",
+ "samples: 100000 classes: 100 time:0.2078 seconds\n",
+ "samples: 100000 classes: 1000 time:0.2079 seconds\n",
+ "samples: 100000 classes: 10000 time:0.2139 seconds\n",
+ "samples: 100000 classes: 100000 time:0.2746 seconds\n",
+ "samples: 1000000 classes: 10 time:0.3496 seconds\n",
+ "samples: 1000000 classes: 100 time:0.3753 seconds\n",
+ "samples: 1000000 classes: 1000 time:0.3382 seconds\n",
+ "samples: 1000000 classes: 10000 time:0.3291 seconds\n",
+ "samples: 1000000 classes: 100000 time:0.3089 seconds\n"
+ ]
+ }
+ ],
+ "source": [
+ "cudf_split_time = []\n",
+ "for i in range(4,7):\n",
+ " for j in range(1,6):\n",
+ " samples,classes,t = stratified_kfold_timing(n_splits=10,classes=10**j,samples=10**i,gpu=True)\n",
+ " if t is None:\n",
+ " continue\n",
+ " print('samples: %d classes: %d time:%.4f seconds'%(samples,classes,t))\n",
+ " cudf_split_time.append([samples,classes,t])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import matplotlib.pyplot as plt\n",
+ "%matplotlib inline"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ "
"
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "plt.figure(figsize=(15,5))\n",
+ "colors = ['b','g','r']\n",
+ "\n",
+ "plt.subplot(1,2,1)\n",
+ "seq = {}\n",
+ "for samples,classes,t in sklearn_split_time:\n",
+ " if samples not in seq:\n",
+ " seq[samples] = [[],[]]\n",
+ " seq[samples][0].append(classes)\n",
+ " seq[samples][1].append(t)\n",
+ "plt.yscale('log')\n",
+ "plt.xlim(5,5*10**5)\n",
+ "plt.ylim(10**(-3),10**3)\n",
+ "plt.xscale('log')\n",
+ "plt.xlabel('number of classes')\n",
+ "plt.ylabel('run time: seconds')\n",
+ "plt.grid()\n",
+ "for samples,color in zip(seq,colors):\n",
+ " plt.scatter(seq[samples][0],seq[samples][1],c=color,label='%d samples'%samples) \n",
+ " plt.plot(seq[samples][0],seq[samples][1],c=color)\n",
+ " plt.legend(loc='upper left')\n",
+ " plt.title('sklearn stratified split')\n",
+ " \n",
+ "plt.subplot(1,2,2)\n",
+ "seq = {}\n",
+ "for samples,classes,t in cudf_split_time:\n",
+ " if samples not in seq:\n",
+ " seq[samples] = [[],[]]\n",
+ " seq[samples][0].append(classes)\n",
+ " seq[samples][1].append(t)\n",
+ "plt.yscale('log')\n",
+ "plt.xscale('log')\n",
+ "plt.xlim(5,5*10**5)\n",
+ "plt.ylim(10**(-2),10**1)\n",
+ "plt.xlabel('number of classes')\n",
+ "plt.ylabel('run time: seconds')\n",
+ "plt.grid()\n",
+ "for samples,color in zip(seq,colors):\n",
+ " plt.scatter(seq[samples][0],seq[samples][1],c=color,label='%d samples'%samples) \n",
+ " plt.plot(seq[samples][0],seq[samples][1],c=color)\n",
+ " plt.legend(loc='upper left')\n",
+ " plt.title('cudf stratified split')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The real landmark data has more than 200K classes and 4 million samples, hence rapids can get more than **1000x speedup**."
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.6.8"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/blog_notebooks/plasticc/notebooks/cudf_workaround.py b/the_archive/archived_competition_notebooks/kaggle/malware/cudf_workaround.py
similarity index 100%
rename from blog_notebooks/plasticc/notebooks/cudf_workaround.py
rename to the_archive/archived_competition_notebooks/kaggle/malware/cudf_workaround.py
diff --git a/the_archive/archived_competition_notebooks/kaggle/malware/draw.py b/the_archive/archived_competition_notebooks/kaggle/malware/draw.py
new file mode 100644
index 00000000..f29a8200
--- /dev/null
+++ b/the_archive/archived_competition_notebooks/kaggle/malware/draw.py
@@ -0,0 +1,19 @@
+import pandas as pd
+import seaborn as sns
+
+
+def pie_chart(data,tags,title=None,transpose=False,figsize=(16,8)):
+ sns.set()
+ dic = {}
+ values = set()
+ for i in data:
+ for k in i:
+ values.add(k)
+ values = list(values)
+ for i,tag in zip(data,tags):
+ dic[tag] = [i.get(k,0) for k in values]
+ df = pd.DataFrame(dic, index=values)
+ if transpose:
+ df = df.transpose()
+ df.plot(kind='pie', subplots=True, figsize=figsize,title=title)
+
diff --git a/the_archive/archived_competition_notebooks/kaggle/malware/malware_time_column_explore.ipynb b/the_archive/archived_competition_notebooks/kaggle/malware/malware_time_column_explore.ipynb
new file mode 100644
index 00000000..f3c066c6
--- /dev/null
+++ b/the_archive/archived_competition_notebooks/kaggle/malware/malware_time_column_explore.ipynb
@@ -0,0 +1,742 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "GPU_id = 0\n",
+ "os.environ['CUDA_VISIBLE_DEVICES'] = str(GPU_id)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import pandas as pd\n",
+ "import cudf as gd\n",
+ "import numpy as np\n",
+ "from collections import OrderedDict,Counter\n",
+ "import re\n",
+ "from librmm_cffi import librmm\n",
+ "import nvstrings\n",
+ "import time\n",
+ "import draw\n",
+ "from termcolor import colored\n",
+ "from nvstring_workaround import get_unique_tokens,on_gpu,get_token_counts,is_in\n",
+ "import warnings\n",
+ "\n",
+ "warnings.filterwarnings(\"ignore\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "PATH = '/raid/data/ml/malware/input'"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**The purpose of this notebook is to study the difference between train and test datasets in order to develop a robust validation scheme. This is important to the generalization capability of models to unseen dataset (test data on private leaderboard).** I'm also trying to use cudf and nvstring as much as possible."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Table of contents\n",
+ "[1. Previous CV schemes](#prev) \n",
+ "[2. Functions](#func) \n",
+ "[3. Visualizations](#vis) \n",
+ "[4. Conclusions](#conclusions) "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 1. Previous CV schemes\n",
+ "\n",
+ "Previously, I used the naive __[K-Fold cross validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html)__ with random shuffling and observed some discrepencies between cross validation AUC (CV) and leaderboard AUC (LB) as follows: "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
CV
\n",
+ "
LB
\n",
+ "
description
\n",
+ "
\n",
+ "
\n",
+ "
models
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
lgb1
\n",
+ "
0.730
\n",
+ "
0.675
\n",
+ "
lightGBM
\n",
+ "
\n",
+ "
\n",
+ "
lgb2
\n",
+ "
0.732
\n",
+ "
0.672
\n",
+ "
lightGBM with mean target features
\n",
+ "
\n",
+ "
\n",
+ "
ffm
\n",
+ "
0.727
\n",
+ "
0.680
\n",
+ "
Field aware factorization machine
\n",
+ "
\n",
+ "
\n",
+ "
nn
\n",
+ "
0.729
\n",
+ "
0.678
\n",
+ "
Neural network
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " CV LB description\n",
+ "models \n",
+ "lgb1 0.730 0.675 lightGBM\n",
+ "lgb2 0.732 0.672 lightGBM with mean target features\n",
+ "ffm 0.727 0.680 Field aware factorization machine\n",
+ "nn 0.729 0.678 Neural network"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "scores = pd.DataFrame({'models':['lgb1','lgb2','ffm','nn'],'CV':[0.730,0.732,0.727,0.729],'LB':[0.675,0.672,0.680,0.678]})\n",
+ "scores['description'] = ['lightGBM','lightGBM with mean target features','Field aware factorization machine', 'Neural network']\n",
+ "scores = scores.set_index('models')\n",
+ "scores"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Two points to be noted from the table above:\n",
+ "1. **CV AUC is higher than LB** which means there is some overfitting.\n",
+ "2. **Improvement of CV doesn't lead to improvement of LB** which means the split between train and test dataset is *not* the same as the KFOLD random split.\n",
+ "\n",
+ "Actually, the dataset provided here has been roughly __[split by time](https://www.kaggle.com/c/microsoft-malware-prediction/data)__. And random split for time series is __[not a good idea](https://www.datapred.com/blog/advanced-cross-validation-tips)__. \n",
+ "\n",
+ "I believe a time-based split will 1) reduce the gap between CV and LB and more importantly 2) align the improvement of CV to LB so that we can evaluate ETL, feature/model selection locally without submitting to kaggle."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "What's annoying is the dataset **doesn't include an explicit timestamp column**, so we have to infer the timing information from other columns. My intuitive assumption is that the version number of the defender software contains timing information: higher version number means more recent observations. Hence, in this notebook, I'll study these 4 columns:\n",
+ "1. *ProductName* - Defender state information e.g. win8defender\n",
+ "2. *EngineVersion* - Defender state information e.g. 1.1.12603.0\n",
+ "3. *AppVersion* - Defender state information e.g. 4.9.10586.0\n",
+ "4. *AvSigVersion* - Defender state information e.g. 1.217.1014.0"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "### 2. Functions"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def rmse(a,b):\n",
+ " \"\"\"compute root mean square error of two numpy arrays\n",
+ " \"\"\"\n",
+ " return np.mean((a-b)**2)**0.5"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def get_topk_token_count(nvs,k=5):\n",
+ " \"\"\"get top-k token counts of a nvstring object\n",
+ " \n",
+ " Parameters\n",
+ " ----------\n",
+ " nvs : a nvstring object, \n",
+ " k : integer, for top-k\n",
+ " \n",
+ " Returns\n",
+ " ----------\n",
+ " nvs_count : a dictionary (collections.Counter) token (str) => count (int)\n",
+ " including all tokens of that nvstring\n",
+ " nvs_count_topk : a dictionary (collections.Counter) token (str) => count (int)\n",
+ " including top-k frequent tokens of that nvstring\n",
+ " \"\"\"\n",
+ " nvs_count = get_token_counts(nvs)\n",
+ " nvs_count_topk = dict(nvs_count.most_common(k)) \n",
+ " sum_top = sum([j for i,j in nvs_count_topk.items()])\n",
+ " ratio = '%.4f'%(sum_top/nvs.size())\n",
+ " ratio = colored(ratio,'red')\n",
+ " print('# of unique values: %d, top%d %s, top%d percentage:'%(len(nvs_count),k,str(nvs_count_topk),k),ratio) \n",
+ " return nvs_count,nvs_count_topk"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def overlap_piechart(train_count,test_count,title):\n",
+ " \"\"\"draw a pie chart of overlapped ratio between two datasets\n",
+ " Parameters\n",
+ " ----------\n",
+ " train_count : a dictionary (collections.Counter) token (str) => count (int)\n",
+ " for train data\n",
+ " test_count : a dictionary (collections.Counter) token (str) => count (int)\n",
+ " for test data\n",
+ " \n",
+ " \"\"\"\n",
+ " train_in_test = is_in(train_count,test_count)\n",
+ " test_in_train = is_in(test_count,train_count)\n",
+ " t1 = colored('%.3f'%train_in_test,'red')\n",
+ " t2 = colored('%.3f'%test_in_train,'red')\n",
+ " print('train_in_test ratio',t1,' test_in_train ratio',t2)\n",
+ " data = []\n",
+ " data.append({'in test':train_in_test,'not in test':1-train_in_test})\n",
+ " data.append({'in train':test_in_train,'not in train':1-test_in_train})\n",
+ " tags = ['train data','test data']\n",
+ " draw.pie_chart(data,tags,title=title,figsize=(16,4))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "### 3. Visualizations"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "CPU times: user 5.09 s, sys: 1.22 s, total: 6.31 s\n",
+ "Wall time: 6.3 s\n"
+ ]
+ }
+ ],
+ "source": [
+ "%%time\n",
+ "# peak gpu memory usage is 17GB!\n",
+ "cols = ['ProductName', 'EngineVersion', 'AppVersion', 'AvSigVersion']\n",
+ "train = gd.read_csv('%s/train.csv'%PATH,usecols=cols)\n",
+ "test = gd.read_csv('%s/test.csv'%PATH,usecols=cols)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**ProductName: name of the defender software**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "train Counter({'win8defender': 8826520, 'mse': 94873, 'mseprerelease': 53, 'scep': 22, 'windowsintune': 8, 'fep': 7})\n",
+ "test Counter({'win8defender': 7797245, 'mse': 55946, 'mseprerelease': 34, 'scep': 16, 'fep': 7, 'windowsintune': 5})\n",
+ "CPU times: user 1.96 s, sys: 232 ms, total: 2.19 s\n",
+ "Wall time: 2.24 s\n"
+ ]
+ }
+ ],
+ "source": [
+ "%%time\n",
+ "col = 'ProductName'\n",
+ "train_count = get_token_counts(train[col].data)\n",
+ "print('train',train_count)\n",
+ "test_count = get_token_counts(test[col].data)\n",
+ "print('test',test_count)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**The following pie charts show that**:\n",
+ "1. there are 6 difference product names in both train and test\n",
+ "2. most samples are from product *win8defender*\n",
+ "3. the distribution is similar in train and test"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ "
"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "draw.pie_chart([train_count_topk,test_count_topk],['train','test'],title=col+' top 5',figsize=(16,4))\n",
+ "draw.pie_chart([train_count_topk,test_count_topk],['train','test'],title=col+' top 5',figsize=(18,2),transpose=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Observations**\n",
+ "1. The top-5 AvSigVersion of train and test are completely different and they have no overlap. \n",
+ "2. For example, `1.277.1102.0` can only be found in test data but not in train data. Similarly, `1.251.42.0` can only be found in train data but not in test data.\n",
+ "3. the top 5 of train are 1.251.x ~ 1.275.x and 4 of the top 5 of test is above 1.277.x. This may indicate that higher version number is from more recent observations. \n",
+ "4. we could use AvSigVersion as a timestamp to split data"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "train_in_test ratio \u001b[31m0.966\u001b[0m test_in_train ratio \u001b[31m0.154\u001b[0m\n"
+ ]
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ "
"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "overlap_piechart(train_count,test_count,col)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**We also look at overall token counts instead of just top-5 tokens**\n",
+ "1. for test data, there are only 15.3% of samples whose AvSigVersion present in train data. This indicates that most of test data are with new AvSigVersions.\n",
+ "2. for train data, there are 97.5% samples whose AvSigVersion present in test data. This indicates the 15.3% test data (the red slice of the left pie chart) actually contain most of the AvSigVersions of train. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**The similar analysis is also done to EngineVersion and AppVersion**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "==============Train==============\n",
+ "# of unique values: 68, top5 {'1.1.15000.2': 265218, '1.1.15200.1': 212408, '1.1.14600.4': 160585, '1.1.14800.3': 136476, '1.1.15300.6': 120295}, top5 percentage: \u001b[31m0.1003\u001b[0m\n",
+ "==============Test==============\n",
+ "# of unique values: 69, top5 {'1.1.15300.6': 3101305, '1.1.15400.4': 2106236, '1.1.15200.1': 366085, '1.1.15100.1': 158036, '1.1.14600.4': 138514}, top5 percentage: \u001b[31m0.7475\u001b[0m\n",
+ "CPU times: user 708 ms, sys: 232 ms, total: 940 ms\n",
+ "Wall time: 941 ms\n"
+ ]
+ }
+ ],
+ "source": [
+ "%%time\n",
+ "col = 'EngineVersion'\n",
+ "k = 5\n",
+ "print(\"==============Train==============\")\n",
+ "train_count,train_count_topk = get_topk_token_count(train[col].data,k)\n",
+ "print(\"==============Test==============\")\n",
+ "test_count,test_count_topk = get_topk_token_count(test[col].data,k)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/png": "iVBORw0KGgoAAAANSUhEUgAAAyoAAAEECAYAAADH8MCoAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjAsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+17YcXAAAgAElEQVR4nOzdeZxN9f/A8de9d/Z9NYPsfsguYwmhlMrImhQtJBWRLSqiUiqVCKVsSXwxFUMI2ZeQrexkG2PG7JuZufs5vz+GmzErhnvHvJ+Phwdz3p97zvsenM993/M5n49GVVUVIYQQQgghhHAgWnsnIIQQQgghhBA3kkJFCCGEEEII4XCkUBFCCCGEEEI4HClUhBBCCCGEEA5HChUhhBBCCCGEw5FCRQghhBBCCOFwpFARQggHtXz5curWrWvvNEqcxWKhdu3arFmzxt6pCCGEcGAaWUdFCCFuzjvvvMOKFSvybPfw8ODQoUMldhyDwUBmZiZBQUElts8//viDIUOGsGbNGmrWrJkn/sEHH7Bt2zY2bdqEVnvnvstKTEzEx8cHV1fXO3YMIYQQpZuTvRMQQojSKCwsjGnTpuXaVtIf7N3c3HBzcyvRfT788MMEBwcTERHB2LFjc8X0ej2rV6+mX79+t/xezGYzzs7ORbYLDg6+pf0LIYQoO2TolxBC3AJnZ2eCg4Nz/QoMDATghRdeYNy4cXzzzTe0bt2a5s2bM2bMGLKysmyvVxSFr776ipYtW9KkSRNGjBjBggULcg31unHo17WfDxw4QPfu3WnUqBE9evTg8OHDuXKLiopi6NChhIWF0axZM15++WVOnToFgJOTEz179mTlypWYTKZcr/v999/Jzs6mV69etm07duygd+/eNGzYkIceeoixY8eSlpZmi7/11lsMGDCABQsW8PDDD9OgQQPMZjP79u3j2WefpUmTJjzwwAN07dqVP//8E8h/6Fd8fDzDhg0jLCyMhg0b8sILL3D8+HFb/M8//6R27drs3r2b5557joYNGxIeHs6OHTtu/i9PCCFEqSCFihBC3AHr168nPT2dhQsX8tVXX7F161bmzJlji//444/89NNPtmFkDRs25Ntvvy1yv9cKnHHjxrF8+XICAgIYPnw4FosFgKSkJPr06UNAQACLFy9m2bJlVKtWjRdffJGUlBQAevXqRUZGBhs2bMi174iICNq1a0dISAgAO3fuZMiQIXTp0oXffvuNb775hqioKN58881crzt48CAHDx5k1qxZREZGoqoqr7/+Ok2aNCEyMpLly5fzxhtvFDjMS1VVBg0axMWLF5k9ezYRERH4+/vTv3//XEURwOTJk3njjTdYtWoV9erVY8SIEVy5cqXI8yaEEKL0kUJFCCFuwV9//UWTJk1y/Xr99ddt8QoVKjB27Fhq1KhBmzZtePLJJ9m9e7ctPn/+fF566SW6detG1apV6d+/P61bty7yuKqqMnbsWMLCwqhRowZDhw4lJiaGixcvArBkyRIqVqzIhx9+SO3atalevTrvvfce3t7erFq1CoD77ruP1q1bExERYdvv2bNnOXToEL1797Zt++abb+jXrx99+/alSpUqNGzYkM8++4y9e/dy+vRpWztnZ2cmT55MnTp1qFOnDpmZmWRmZtKhQweqVKlC1apV6dixI02bNs33Pe3cuZNjx44xZcoUHnjgAerUqcPnn3+OTqdj6dKludoOHTqUNm3aULVqVUaNGsWVK1c4evRokedNCCFE6SPPqAghxC1o2LAhkydPzrXt+udJ6tSpkytWrlw5du7cCcCVK1dISEigcePGudo0btyY9evXF3pcjUaTa9/lypUDIDk5merVq3PkyBGOHTtGkyZNcr3OYDAQFRVl+7l3794MHTqUqKgoqlSpQkREBBUqVKBt27a2NkePHuXo0aMsXLgwTx5RUVHUqlULgJo1a+Lu7m6LBQQE0KNHD/r370/Lli1p1qwZHTt2pGrVqvm+pzNnzhAUFET16tVt29zc3GjQoAH//vtvrrb3339/nveelJSU/8kSQghRqkmhIoQQt8DNzY0qVaoUGL/xgXKNRsONkyxqNJqbPq5Wq0Wn0+XZh6Iott9btmzJhAkT8rzW29vb9ueHH36YoKAgIiIiGDZsGJGRkbzwwgu5HqK/NiSrc+fOefZ1/Uxk1xcp13z66af069ePXbt2sWvXLqZPn87777+f6/mXW3H9eb3xvQshhLi3SKEihBB3mbe3N+XKlePQoUO0a9fOtv2ff/657X3Xr1+fFStWEBoaWujUv9ceqv/555+pVasWV65cyVNE1KtXjzNnzhRakBWmdu3a1K5dm5dffplx48YRERGRb6FSs2ZNkpKSOHfunO2uisFg4MiRI7z00ku3dGwhhBClnzyjIoQQt8BsNpOYmJjnV3GXpnr55Zf58ccfWbVqFRcuXGDBggXs2rXrlu6yXO/555/HarUyePBg9u/fz6VLl9i/fz9Tp07l4MGDudr26tWL1NRUJk2alOsh+muGDRvGhg0bmDx5MidOnCAqKopt27bx7rvv5pkx7Hrnzp1jypQpHDhwgJiYGNvD9vmt2wLQpk0b6tWrx6hRozh48CCnTp1izJgxWK3WXM/MCCGEKFvkjooQQtyC/fv306ZNmzzbr39gvjAvvfQSKSkpTJo0CZPJRPv27enfvz/ff//9beUVFBTEsmXL+OqrrxgyZAiZmZkEBwfTtGnTPGuXXHuo/toUxDdq1aoVP/zwAzNnzmTp0qWoqkqFChVo06ZNruFnN/L09OTcuXNERkaSmpqKv78/Dz/8MGPGjMm3vUajYdasWXzyyScMHDgQs9lMo0aNmD9/Pn5+frd1PoQQQpResjK9EEI4iHfffZdTp06xfPlye6cihBBC2J3cURFCCDuIj49n48aNtGjRAq1Wy5YtW1i5ciXjx4+3d2pCCCGEQ5A7KkIIYQdJSUmMGDGCU6dOYTQaqVy5Mi+88ALPPPOMvVMTQgghHIIUKkIIIYQQQgiHI7N+CSGEEEIIIRyOFCpCCCGEEEIIhyOFihBCCCGEEMLhSKEihBBCCCGEcDhSqAghhBBCCCEcjhQqQgghhBBCCIcjhYoQQgghhBDC4UihIoQQQgghhHA4UqgIIYQQQgghHI4UKkIIIYQQQgiHI4WKEEIIIYQQwuFIoSKEEEIIIYRwOFKoCCGEEEIIIRyOFCpCCCGEEEIIhyOFihBCCCGEEMLhSKEihBBCCCGEcDhSqAghhBBCCCEcjhQqQgghhBBCCIcjhYoQQgghhBDC4TjZOwFx7zObzURHR6PXG+ydisiHTqcjIMCfoKAgtFr57kIIIcoC6Zsdm/TNOTSqqqr2TkLc286dO4eTkyteXr5oNBp7pyOuo6oqVquFjIxUnJy0VKlSxd4pCSGEuAukb3Zc0jf/p+yWaOKu0esNciF0UBqNBicnZ/z9g8jKyrJ3OkIIIe4S6Zsdl/TN/5FCRdwVciF0bBqNFrm3KoQQZYv0zY5N+mYpVIQQQgghhBAOSB6mF3edu4crbq4l/0/PYLSgzzYW2W769Kls2bKJy5djWbw4gho1auZps3fvbmbNmsnZs2fo1etZ3nxzRL77MplMjBkzghMnTgCwfv1mWyw2NpZevbpSvXoN27aZM7/D19cPgMjI5SxatABVhQcfbMXIkWNsD8zdauya9PQ0PvhgPDExl3B2dua++yrxzjvv4e/vX+T5EUIIUfb4eDqhc3Et8f1aTUYysixFtnOEvnn79q3Mmzcbs9mMqqp07tyVvn1fsLWbP38Oa9b8BkB4+FO8/PLAYsWuN2HCOA4e3EdSUhKbN+/Ew8OjyHNTlkmhIu46N1cnnhq1ssT3+9uUrsUqVNq2bU/v3s/x2msDCmxToUJFxo6dwObNGzGZTAW202q19OnzIn5+fgwdOihP3MvLm59+Wppne2xsDPPmzWbhwiX4+voyYsQQ1q1bS6dOnW85lpuG559/iaZNwwCYMWMq3347nXHj3i/y/AghhCh7dC6unJvUs8T3W33cr1CMQsUR+uaAgEC+/PJrgoODycy8Qr9+falXrx6NGz/AoUMH2LTpDxYvjgBgwIAXadLkAZo0aVpo7EZdunRl+PBRdOr0aJHnRMjQL1EGNW7chJCQ0ELbVKpUmVq1aqPT6Qpt5+TkRPPmLfDy8r6pHDZv3ki7du3x9/dHq9XStWsPNm7ccFux6/n6+tqKFIB69Rpy+fLlm8pRCCGEuFscoW+uX78BwcHBQE4xU6VKNVvfuXHjBjp16oybmxtubm506tTZ1v8WFrtRWFhzAgICbiqvskwKFSHuoKysTPr168tLL/Vh0aIfuTYbeFxcHKGh5W3tQkJCiY+Pu61YQRRFYfnyn3nooXYl9r6EEEKI0qqgvvl6Fy6c59ixI4SFNQcK6n/ji4yJ2yNDv4S4Q4KCgli1ah0BAQGkpKQwevRwvL196Nq1+13NY8qUyXh4eNCrV++7elwhhBDC0RSnb05KSmTMmJGMHv2u7Q6LsA+5oyLEHeLi4mK7vRsQEMDjj3fi8OG/AQgNDSUu7r+hWPHxcbZb3rcay8/06VOJjo7m448/K9Mr2wohhBBQeN8MkJKSwtChg3j++Zfo0OEx2/b8+9+QImPi9sgnFyHukJSUFCwWMwAGg54dO7ZRq1ZtAB5+uAPbtm0lNTUVRVFYuXK57YJ4q7EbzZo1g5MnT/D551NwcXG5C+9YCCGEcGyF9c3p6Wm8+eYgnn66N126dMv1ukceeYy1a1djMBgwGAysXbuaDh06FhkTt0ej5jcwT4gSdOzYcSpUqGL72d7TE0+Z8jlbt24mJSUZX18/fH19WbLkF0aMGMqrrw7i/vvr8vffhxg//t2rK8KqeHp6MW7cBFq2bMXy5b+QlJTIq6/mzCTSv//zJCQkkJqaQmBgEC1btmLcuAls2bKJOXO+Q6vVYrFYaN36IQYPHmp7CHDFil9YtGghAM2bt+Stt96+rdiJE8eZPXsWU6fO4Ny5s/Tp04vKlavg6poz3WSFChWZPHlKgeclNjaKevXq3sKZF0IIUdrc2Dfbe3piR+ibZ8yYyi+/RFC58n/npXfv5+jcuSsAc+Z8x++/rwHgySfDGTjwdVu7gmLbt29jx45tjBs3AYC33x7F8ePHSExMIDg4mOrVa/D1198WeF7Ket8shYq44268GArHVNYvhkIIUZZI31w6lPW+WYZ+CSGEEKJETJ48mUceeYTatWtz+vTpfNvs3LmTHj16UL9+fSZPnlzgvkwmEwMGDKBFixa0aNEiV+zSpUvUrVuXrl272n6lpqYCcOLECbp3707Xrl0JDw9n/PjxudbciIiI4LHHHuPRRx9l4sSJKIpSrNj1jEYj77//Ph07duSpp55i/PjxxT5HQojik1m/hBBCCFEiOnTowIsvvkjfvn0LbFOpUiUmTZrEunXrily0b8CAAfj7+9OvX788cW9vb1auzLt4cLVq1Vi2bBkuLi4oisKwYcNYunQpL774ItHR0cycOZPIyEj8/PwYOHAgq1atolu3boXGbvTFF1/g6urK+vXr0Wg0JCUlFe8ECSFuihQq4pZZLApGsxWtVoNOq8FJp8WqqJjMVgxGC9lGC9kGMyazlSvZJhQFW1vttV8aDVoNaDQa236vjUbUaDRct1kIIYSDCwsLK7JNlSo5w402bix8dXEnJydatWrFpUuXbioHNzc3258tFgsGg8E26+H69et59NFHbbM+9erVi+XLl9OtW7dCY9fLysoiMjKSbdu22fquoKCgm8rxTrJYLRitJrQaLU5aJ7QaLQaLgUxTNnqLAYPZQLbZABYzV4yZaDRatGjQaK/+rtGgue53rUZDTq+swtWfhbhbpFARRVIUFYMp50E4F2cdKRkGzsdmcPJCChcuZxCTkEmm3kS2wYJVyfvI08juFbFos4s+kAa0Gg06nRZnJy0uTlpcXXS4OOtwdro6SlGVAkYIIUROwdCjRw8AOnXqxIABA2yFQ3x8PK+++ioXL16kXbt2PPPMMwBcvnyZChUq2PZRoUIF28rjhcWuFx0djZ+fHzNnzmTv3r14enoybNiwYhVpJcmqWDFYTGg1Gpx1TiRlpRKVdokzKRe4mB7D5SsJpBuvoDcb8n394Gp9ULJSinUsJ60OZ60TzjpnnHVOuOhccNY54aTR2YoYDVrpm0WJk0JF5GE0WVFRsVgUouKucDoqhbMx6TlFSWImFusdmn9BBUVVURQrZrOVG0sbJyctrs463Fx0uLk64eqcMwuWFC5CCFG2lCtXjm3bthEYGEhycjKDBg3C19eXXr16ARASEsLKlSvJzs5m9OjR/PHHH4SHh5fIsa1WK9HR0dStW5e3336bf/75h9dff50//vgDLy+vEjlGfixWCybFjBYNp5LOcSr5LFFpMUSnxxKflZTv6uoldmzFikWxorfknVnTSavDReeMh7M77s5uOGudUKVwESVEChWBqiiYTUY0Ti78G53Gn4djOXAygUsJmfZOLReLRcFiUcjSm23bXF2c8PJwxsvdGZ1OC6hyW1oIIe5xLi4uBAYGAhAYGMhTTz3FwYMHbYXKNR4eHnTq1InffvuN8PBwypcvT2xsrC0eGxtL+fLlAQqNXa98+fI4OTnRuXNnABo1aoS/vz/nz5+nQYMGJfo+s816nLXOXM5MYG/0QQ5ePsq51It3tCi5WdeKmOyrd260Gg1uTq55CxeNFumdxc2SQqWMUhUrqtmEqipkn/4L9xoPMDPyBJv2Rd/xYzf8v0Bc3d2KbniTTAYDF2Iz8fZwxtXFCVVV0WrzXhanT5/Kli2buHw5lsWLI6hRo2aeNnv37mbWrJmcPXuGXr2e5c03R+R/TJOJMWNGcOLECQDWr99si8XGxtKrV1eqV69h2zZz5nf4+vqxfftW5s2bjdlsRlVVOnfuSt++L9jazZ8/hzVrfgMgPPwpXn55YLFi15swYRwHD+4jKSmJzZt34uHhUeC5E0KI0iQ5ORkfHx+cnZ3R6/Vs3ryZ9u3bAzlDs0JCQnBxccFkMrFp0yZq1aoFwOOPP07fvn0ZMmQIfn5+/Pzzz7aCo7DY9QICAmjRogW7du2iTZs2nD9/nuTkZNuzN7fDqiiYrSYsqpW/Lx/nr0t/cyT+JFnmYgyfvk3331cVT1f3Et9vlknPxbh4vFw88XBxJ+cLxbyTzjpC33z69Ck+/vhDVFXBYrHQsGEjRo1627ZocmTkchYtWoCqwoMPtmLkyDG2558Ki13PaDQybdoU9u3bi6urK/XrN+Ddd2XWuIJIoVLGKEY9aDRkHt3OlcNbMcacBlQCO77MY82a3ZVCxdXdjXOTepb4fquP+5X0zGTSM41otBo8XHPutni6O4OKrWhp27Y9vXs/x2uvDShwXxUqVGTs2Als3lz4w55arZY+fV7Ez8+PoUMH5Yl7eXnz009L82wPCAjkyy+/Jjg4mMzMK/Tr15d69erRuPEDHDp0gE2b/mDx4ggABgx4kSZNHqBJk6aFxm7UpUtXhg8fRadOjxZ+4oQQooR8/PHHbNiwgaSkJPr374+fnx9r1qxh4MCBvPnmmzRo0ID9+/czcuRIMjMzUVWVNWvWMGnSJB566CGWLFlCQkICw4YNA6Bnz57Ex8eTkZFB27Zteeihh5g0aRIHDhxg+vTptkX72rdvz/PPPw/AwYMHmTt3LhqNBkVRaNasGYMHDwZyZhwbPHiw7ZmV1q1b06VLlyJjR44cYfr06cyZMweADz/8kLFjxzJ58mScnJz4/PPP8fHxueXzZjAb0Wg07L10iN//3cLZlKhb3tet8nR155llefux2xXRexZZZj1ZZj2abPBwcsfb1QsPZzdUsI2CcIS+uXLlKsyb9yPOzs4oisLYsWNYseJXevd+jtjYGObNm83ChUvw9fVlxIghrFu3lk6dOhcau9HMmV/j6urCzz9HotFoSE5OLsZZLLukUCkDVKsVVbFgSYsnbfdKsk78iWrJ/R886+Qe/q9nOztlWPJURSVLbyZLb0aj0eDj6YK/jytajYbGjZsU+fpKlSoDsG3blkLbOTk50bx5i1zDBQDi4uI4d+4cVmvOjDPXz0IDUL9+AzIzMzl79iwGg4HQ0JyHNhs3ho0bN9CpU2fba558MpwVK37Fzc2Dn39elivWseOTREQsxdvb17bvqlWr4eSkIyysOSkpOesKnDlzhqCgYMqXL28bM5ySkkpSUiKQc9GG/4YSREREMGfOHFRVpW3btrz33nt5vhlKTU1lzJgxXLx4ERcXF6pUqcLEiRNtM+YIIcqe9957j/feey/P9msf8CFnZrDt27fn+/rnnnsu18+//vprvu06duxIx44d841dW1elIM8++yzPPvvsTcUaNGiQ6z1UqlSJn376qcBjFIdVsWJWLCRnp/LbqY38eXE/hnyeAbmXqCq2okWr0eDl4omvmzdOWicaNWpS5DMtt9s3F+XGGeOMRqPtS87NmzfSrl17/P39AejatQerV6+iU6fOhcaul52dze+/r2bVqnW2iR+uDWEU+ZNC5R6mmPSg0ZJ5fBcZ+9Zgir9QYFvDpVM4OTlRo6IvZ2PS716Sd4GqqqRnGknPNOLu5kSAjxuuLnf2n763tw+VK1dCr9fz6qsvo9HAY489Tt++L9ouTs7OLlSoUJHjx49y+vRJ3n//IyCnyHnggf9mjwkJCeXAgX1UrVqN5ORkQkPLXxcLYffuXdSsmfcWuclkIiEhAYAaNWqSmJhAenoafn5+tljNmjXQ6ZyIirqAxaIHKPZaAhqNhldeecW2ENvkyZP58ssv+eSTT0rwTAohxL1Fbzag0WjYFbWfdWe2EJUWY++U7EJRVTKMmWQYM3HROePn5o2Xi+ddOXZWVib9+vVFVdU8fXNiYiIjRw4lJuYSDz7Ymm7dckaAxMXF3dD/hhIfH1dk7HoxMZfw9fVj3rzvOXBgP+7uHrz22uBifYFaVkmhcg9STHoUk5HUHRFkHt2Gasp/asLcL7KiP3+Yrm1r8NWSg3c+STvRGyzEGDJxdtKiqurVWcbyf5bldnh6euDsXJ5p076lQYMGV2eeGY63tw9du3YHwNXVhaSkRD755ENee+0NgoOD892XRqNBp3O6OllA8WVkZNiGImg04O8fQFpaKn5+fraYk1POJcDfP4CLF88Aha8zcD0/P79cq0U3btyYJUuW3FSOQghRVujNBvRmA8uOrmLXxf2YrOaiX1RGmKxmErJSSNGn2/pmVVVzrbFWUoKCgli1ah0BAQGkpKTk6ZuDg4P56ael6PV6PvjgPbZu3cxjjz1eIse2Wq3ExFyiVq06DB06gqNHjzB69HB++WUlnp53bsa40uzmPvkIh6aYDFj1V0je9BMXZ7zGlYPri1ekXJV5fBdNa976GNvSxGxRsCoqMYlZJGcYsFpzCpaS5OLigo9PzpCsgIAAHn+8E4cP/22Lp6SkMHToILp370Xr1g/ZtoeGhhIX99/c/fHxcYSEhAA5t4hzx+Lx9/fn7NmznD17NtfqyCaTGRcXZ9vPzs7OmM3mAmNWqxUo/loC11MUhSVLlvDII48U48wIIUTZoTcbSMlOY+6BJQxePY4t53dLkVIAi2LFqlqJy8xZA+ZawVKSXFxcbF/E5dc3X+Pu7s6jj3Zk3bq1QEF9c2iRseuFhoai0znRseMTQM4wcF9fPy5evFhyb/AeI4XKPUAxG1GM2aRuX8bF6a9y5eB6UCw3vR/9uUN4eXng7eFcdON7hKqopF8xcuFyBqlXjFcviiWz75SUFCyWnL8Hg0HPjh3bqFWrNgDp6Wm8+eYgnn66t+2Cdc0jjzzG2rWrMRgMGAwG1q5dTYcOOWOxmzVrkSu2YcPv9OjRixo1alClShXS0zNITU0tmTdwEz766CM8PDxsD7MKIURZZzAbSTdcYd7BpQxePY4dUX+hqIq90yoVrIpCcnYaF9NiSDNk5Ix+KKHOOadvzikUb+ybY2Iu2R7SN5vNbN++1Ta0+uGHO7Bt21ZSU1NRFIWVK5fTocNjRcau5+fnT9OmYfz11x4ALl6MIjU1lfvuq1Qi7+1eJEO/SjHFbAJVIW13JOl//XZTd0/y3Z8hC1NiNE89VJ3/rT9VQlk6noXzZ7Bv7w7S01L4bOJbeHn7MHnqD4x9Zzi9nnuZlmFN+Pf0ESaMH0tWVhag8scf6xk3bgItW7Zi+fJfSEpK5NVXc2YS6d//eRISErhyJYOnnnqCli1bMW7cBP755xAzZ36Nm5sbVquV1q0f4umne+fksPAHoqMvEhn5K7/8sgxVheeff4HOnbvStGkY7ds/Qp8+OesBPPlkOA880BSTyUSdOnXzxJo1aw7An3/uYuPG9bz66hv4+/vzxRefcOrUSQB69+5O5cpVGTNmLAAuLs6YTP99o2c2m9HpchbQLO5aAtdMnjyZqKgovvvuu3ynYhRCiLLEZDFhVRV+ObaWdf9uwXwLXxyWRT9+N4d9u/aQnprKp+Pex8vbm8+/m8E7o0fR64U+hDUK4/zJMyXSN8+Z851txrjr++bDh/9h0aIfbTPGNWnyAP375ywBULHifbz88iu88spLADRv3pInnuhUZOzEiePMnj2LqVNnAPD222P5+OMPmT59KjqdE++//xHe3t537TyXNhrVkVYNEsWmmI1kHt9FysYFKIasEtuvb4suGOt34bUv8p+R5VaM7F4RL5//PujeqXVUjHoDh/8tmWn+XF2cKBfgjrNOe1vPr5w6dZoqVSrnmfXregkJCSiKQmho3tvE1zOZTJw9e5b777/fts1isaDT6WwX1YsXL+Lt7U1gYCAmk4lz587nemDe19cPf3+/fGMWi55mzcKIjo6mb9++uR6m79y5M927d8+T01dffcWhQ4eYPXs27u4lP/++EEKUFqqqYrKa2Xx+FxFHVt+VtU9ux+BqffAKDLL9fMfWUTHqOXHpQonsy0XnRDnPIJx1zmVmcefY2Cjq1atr7zTsRu6olDKK2YCizyIhchqG6OMlvv/sf/dRoW3vEt/v9UqqmLiTjCYL0XFX8PZ0IcjPHY2Gm7ooXr58mfT0DCwWCxcuXECn0/F///d/XLgQRUhIOdzd3cnKyiY6OhpFyXk2JOaUOkAAACAASURBVC0tnYoVK+Lt7UVKSgpms4WQkHIAnD17FrPZjNWqcPLkKby9vahYsSLZ2dnExyeg0eR0kt7e3gQE5Ex16OLiQrlywZw7dw4ALy8v/Pz8CoxdXc+q2GsJ/Pvvv3z//fdUrVrVNp3nfffdxzfffHM7p14IIUodo8VIujGTqX/OscsaKCWhpIqJO8lktRCTEYeXqydBHgFo4I48cC8ch9xRKSVUVUG1mMnY/zup25ai3sEH8Sq/OYdpq86x7WDJTJl44x2V0kaj0RDg44avtyslPDmYQynr39oIIcTNunYXZd2ZrSw78huWUjTM68Y7KqWNTqslyCMAD2f3e/ruSlnvm+WOSimgmAxYMpJIWPEVpoQ7/01N1sk9PNEirMQKldJOVVWS0/Vk6k2UD/REq9Pe0wWLEEKIohksRtINGXz151zOp8qsTXebVVGIz0zC3cmVcl5BaDXae7pgKaukUHFwitlI6s5fSN+zEu7SbCFZJ/dQq2fbu3Ks0sRoshIVd4Vgf3e8PMrO+FghhBD/UVQFs9XC76e3EHFsNdarw3eFfegtRi6mxxDsEYCni6f0zfcYKVQclGq1oJj0xC37FGPM3Z2By3DpJE5OTlSv4Mu52HtrlfrbpaoqCSnZZOmdCQnwQKPRINdEIYQoG0xWM1eMmUze8S0X0i7ZOx1xlapCQlYK3hYjQR4BUqzcQ2QuUQekmAyYEi5w6fvhd71IyUnAiv7CEbq2q373j11KZOnNXIy/gtmilNjc7kIIIRyXwWLkbPIFRq37SIoUB3XFmEVsRhwWxVriC0UK+5BCxcEoJgNXDm8mZsFYrFlpdssj8/hOwmr62u34pYHFohAdf4WsbHOJr2ovhBDCcRgsRrae382HW6eRbdbbOx1RCKPVTHR6LHqLUb5IvAfI0C8HoaoKqtlE4upvyTqxy97poD97iODOHni7O3NFX7IzjNWtFYhnIeuK3Kosg4Hjp4ue+vh/C2exb88OEhPj+HTKPCpVrpanzZF/9hHxv3lEXzxPxye70efFQfnuy2QyMnLkGC6cP41WA+vXb7HFYmNj6dWrK9Wr17BtmznzO3x9/Th9+hQff/whqqpgsVho2LARo0a9jcvVOYIjI5ezaNECVBUefLAVI0eOsS2mWFjsekajkWnTprBv315cXV2pX78B7747vsjzI4QQ4j9Gi4n5B5ex9fxue6dyR9WrWB13N9cS36/eYORYzLki2y2e+wP7du0mMT6Bz779mkpVq+Rpc/jgISIWLCL6QhQdu4TT95X++e7LaDIxfPgbXPj3LBqNpth9c0JCAh988B6nTp2kUqVKLFiwONd+S6Jvnj59Klu2bOLy5VgWL46gRo2aRZ6bskwKFQegWq1YDVe4vOgDzEnR9k4HuGGV+g0lO/zM082NZ5bl/8H/dkT0nlWsdk2bteHxTj35aMKwAtsEl6vAK6+/xV97tmE2mwpsp9XqCO/yDN7evnz20WgUVc01NtbLy5ufflqa53WVK1dh3rwfcXZ2RlEUxo4dw4oVv9K793PExsYwb95sFi5cgq+vLyNGDGHdurV06tS50NiNZs78GldXF37+ORKNRkNysuOvXyOEEI7CqlgxWIx8uv0bTicX/UG7tHN3c2XiqNUlvt8JU/L2T/kJe7AFT3TtzMTRYwtsUy40lFeGvcFfO3cX0TdrCe/RDS8fbz4b90Gx+2YPD3deffV1srKymDPnu1yxkuqb27ZtT+/ez/HaawOKc1rKPBn6ZWeq1YI1M5WYeWMcpki5Juv4Tto1KmfvNEpc7fsbEBhU+PsKLV+RKtVqotXpCm2n0+mo37ApHp5eqKjEJGQVaxiYm5sbzs7OQM7q8kajEe3VOY83b95Iu3bt8ff3R6vV0rVrDzZu3FBk7HrZ2dn8/vtqXn11sG0xrMDAwCLzEkIIkXMXJS4zkbfWfVwmihRHULteXQKDgwttE1qhPFVrVEenK/zjq06no36TRnh6eeb0zRlxWFWFokaCeXl507jxA7i5ueeJlUTfDNC4cRNCQkILT0TYSKFiR6rFjDk9gUvzR2O94njfdmef3kdIgIe90yhVjCYLlxIysV4tVrKyMunXry8vvdSHRYt+zPVwX2JiIi+88CxPPPEIHh4edOvWE4C4uDhCQ/9bIDMkJJT4+LgiY9eLibmEr68f8+Z9T79+fRk0aCB//33ojrxnIYS4l5gsJqLSLvHOhk9J1qfaOx1RAkxWMzHpcSiqFVAL7ZsLUhJ9s7h5MvTLThSzCXNKLJcXTUAxZNk7nXyZU2JRjdm0a1KRbYdk8cfiMpmtXIrPpJx/ICtXriMwMICUlBRGjx6Ot7cPXbt2ByA4OJifflqKXq/ngw/eY+vWzTz22OMlkoPVaiUm5hK1atVh6NARHD16hNGjh/PLLyvx9PQqkWMIIcS9xmQxEZUew8St0zBZS/b5TGFfZsXCpYzLBPkHsHLl7wQGBubbNwvHIndU7EAxGzHFnyf2x3EOW6Rck3VqL4+3qGzvNEods8VKfJoRXz8/VBUCAgJ4/PFOHD78d5627u7uPPpoR9atWwtAaGgocXGXbfH4+DjbbeLCYtcLDQ1Fp3OiY8cnAKhfvwG+vn5cvCirJwshRH5MFhPRGZeZuEWKlHuVRVFINKbg4+eLqqqF9s03Kom+Wdw8KVTuMsVsxBB9ksuL3kc1G+ydTpGyTu6hdgUZ/nUrkpOTOR+ThsWqoNfr2bFjG7Vq1QZyhmaZTDkPAprNZrZv30rNmjkzfzz8cAe2bdtKamoqiqKwcuVyOnR4rMjY9fz8/GnaNIy//toDwMWLUaSmpnLffZXuxlsXQohSxWQ1cykjjg+2TMVoLfghbVH6paSkEJVyCYtiRa/PztU3F6Yk+mZx8zSqrIhz1ygWE6a4c8T+9D4oFnunUzxaHVVHLWTEjD23vEr9yO4V8fL5b+ymvacnXjh/Bvv27iA9LQVvb1+8vH2YPPUHvvjkHXr27k/1GrU5deIIM6d9hF6fDaqKu4cnAweNpmHjZmzasIrUlGSefjZnWsTx7wwiJTmRjPQ0/PwDaNi4OQMHvcW+vdv5ddkCtFotGlTatHmIwYOHotPp+P33NSxa9CMajQZFUWjS5AGGDh2B29XzsmLFLyxatBCA5s1b8tZbb6O7+mB/QbETJ44ze/Yspk6dAeQUQx9//CEZGenodE68/vobtGrVusDzEhsbRb16dW/9L0AIIUohs9VMTEYcEzZPwWAx2judu2ZwtT54BQbZfrb39MQ/fjeHfbv2kJ6airevD17e3nz+3Qw+nzCRp5/vQ/VaNTl17DgzPpuCPjsbyOmbXx0+hIZNm7BxzTrSUlJ4+oU+AIwf9hYpScmkp6fj5+9Po6ZNGDh8CPt27eaXRUty+mZVQ5vW//XNVquVbt3CMZtNZGZm4u8fQJcu3Rg48HWgZPrmKVM+Z+vWzaSkJOPr64evry9LlvxS4Hkp632zFCp3iWK1YE1P5NK80aim0rVYVEivd9ibEcLUJbf2MPaNhUpZpNNpqRzijU6nKbqxnZT1i6EQouwxWc3EXUlg/KYv0Vscf5RDSbqxUCmLnLQ67vMpjy6f9U4cRVnvmx33b+YeoqoKqjGb2EUTSl2RApB1fJesUn+brFaFmMRMWcFeCCEchEWxkpSVwvjNZa9IETksipXLV+JlBXsHJoXKXaCajcQumoD1Soq9U7kl2WcP4eXlgZebTBJ3O0xmK3Ep2XJBFEIIB2AwG/hw61T0peB5UXHnGK1mEjKTpG92UFKo3GGK2UhcxGeYEx1rMceboRgyMSVF81TbGvZOpdTL1ptJzTDInRUhhLAjo8XEpO0zSNXf2rOX4t6SZdaTqk+XYsUBSaFyBylmI0lrv8MQddTeqdy2rGO7aNfw3lul3h5SM4xkGcxyQRRCCDswWox8t28RZ1Oi7J2KcCBphgyyTDLqwdFIoXKHKCYDGX9vJPPodnunUiKy/91HaKBMU1xS4lOyMZkV5HoohBB3j8FiZNO5Xey6uM/eqQgHlJiVjMlqKtZK9eLuKLJQmTx5Mo888gi1a9fm9OnT+bbZuXMnPXr0oH79+kyePLnAfZlMJgYMGECLFi1o0aJFrtilS5eoW7cuXbt2tf1KTU0FYOPGjfTo0YPOnTsTHh7O/Pnzc732m2++4dFHH+XRRx/lm2++KXbseqNGjaJNmzbUrl2brKzbW4RRVaxY0hNJ2bjwtvbjSMzJMahGPQ81rmjvVO4NKsQmZmJVVClWhBDiLrBYLcRkxLHw71/tnYpwUCoQdyURq6rYOxVxVZFPR3fo0IEXX3yRvn37FtimUqVKTJo0iXXr1tkWscuPVqtlwIAB+Pv7069fvzxxb29vVq5cmWd7cHAws2bNIiQkhCtXrtCjRw8aNmxIWFgY+/btY926daxevRqAXr160bx5c5o1a1Zo7EZPP/00Y8eOpVWrVkWdkiKpFjNxP39WetZKKaasU3t4smVTdvwdc1v7aVgzEFePkl9HxZht4PCZotdR+d/CWezbs4PExDg+nTKPSpWr5Wlz5J99RPxvHtEXz9PxyW70eXFQvvsym018NXk858+dAuC7+ZF52qiqymcfjSbqwhlbXFEUFs7/lmNHDuDq4kS5cuUYN+4DgoODAYiMXM6iRQtQVXjwwVaMHDkG7dXpEwuLXW/MmJHExsag1Wpxd3dn1Ki3i7WolRBC3IsMViOTd3yLIh9C81WvYlXc3dxLfL96g55jMReKbLd47g/s27WbxPgEPvv2aypVrZKnzeGDh4hYsIjoC1F07BJO31f657svs9nMVxM/4dy/ZwD4fulPtlhifDwjBwzivqqVbdvGfjIRbx8fAKyqQnRKDGOHjsTV1ZUFCxbb2s2fP4c1a34DIDz8KV5+eWCxYvn5+OMPWL16FZs378TDQ0asFKTIQiUsLKzInVSpkvOPaePGjYUWKk5OTrRq1YpLly7dRIrQqFEj25+9vb2pUaMGMTExhIWFsXbtWrp162ZbKK9bt26sXbuWZs2aFRq70YMPPnhTORVEMRlI+n02ltS4EtmfI8k6uYda3R+67f24erixq2vPEsgot9Yri/ctWdNmbXi8U08+mjCswDbB5Srwyutv8deebZjNhRXfOsK7PIO3ty+ffvRWvm3+WLeCoKAQoi6csW07uP9Pzp45wSdfziXI35OF87/hhx/mMmbMu8TGxjBv3mwWLlyCr68vI0YMYd26tXTq1LnQ2I0mTPgQLy9vALZv38rHH3/IwoX/K9Y5EkKIe4nRYuKLnd+TZsiwdyoOy93NnZnv5v/B/3YM+fSHYrULe7AFT3TtzMTRYwtsUy40lFeGvcFfO3cX0TdrCe/RDS8fbz4d936euIeXJ5/OnFbg6xfOm8//1anDxfPnbdsOHTrApk1/sHhxBAADBrxIkyYP0KRJ00Jj+dmxYxsajeOuq+ZIHOoZlaysLHr06EGPHj2YO3duvmMEz549y99//03Lli0BuHz5MhUqVLDFy5cvz+XLl4uM3QmK2UT22UNkHt12x45hT4boEzi7OFOtgo+9U7ktte9vQGBQ4RMDhJavSJVqNdFeXXG2IDqdjvoNm+Lh6ZVvPO7yJXbv2sJT3Z/LtV2j0WAxmzGbTCSnZ5OVlUlwcE5OmzdvpF279vj7+6PVaunatQcbN24oMnaja0UKQGZmJlqtXBSFEGWP0WJie9ReTiT+a+9URCFq16tL4NVRBQUJrVCeqjWqo9MV/vFVp9NRv0kjPL08bzqPk0ePER97mebtH0RVVa59Et24cQOdOnXGzc0NNzc3OnXqbOt/C4vdKD09jXnzZjNs2Mibzq0scpiFMcqVK8e2bdsIDAwkOTmZQYMG4evrS69evWxtEhISGDx4MO+//z4hISF2zDYvVVVRDJkkri74OZhST7GiP3+Erm1rMG3pra1SX5YoisLcWV/Sb8AwdLrc/9WaNH2QE8f+5o2BPXF1daPifZUZPeZdAOLi4ggNLW9rGxISSnx8XJGx/EyaNJG//tqDqqpMmzazJN+eEEKUCkaLkZ/kuRRxHX22nvfeHIWKyoNtHyK8Zzc0Gg0Gg4GfZs9j1IRxxMXGYlYsqKqKRqMhLi6OBx74b5RRSEgohw4dBCg0dqMvvviMgQNfz/VloiiYw9xRcXFxITAwEIDAwECeeuopDh787y85OTmZ/v3788orr/Dkk0/atpcvX57Y2Fjbz5cvX6Z8+fJFxkqaajER/8vkUrny/M3IOr5TVqkvprW/RVCnbiOqVKuZJ3bh/L/EXLrIjO8jmDnnF+6rXJ0vvviixKdFHDduAitXrmXQoDeYMaPg29xCCHEvMliMzNz7IwaL0d6pCAfhFxDAjIVz+Xj6FMZMnMBfu3azdf1GAJbMW8BjnTsREJTzeVRVVdJKcH2VjRs34OzsTOvWtz+MvqxwmEIlOTkZs9kMgF6vZ/PmzdSpUweA1NRU+vfvT9++fXPdYQF44okniIyMxGAwYDAYiIyMtBUyhcVKkmI2knl4K8bYM0U3LuWyzx3C29sDT1mlvkgnjx9mx9b1DB/8HBPHv0lWZibDBz9HdnYWO7aso16DJnh4eqHVamnd9jEOHdqPxaoSEhJKXNx/QxTj4+MICQkFIDS04FhhnnyyMwcP7ic9Pa3k36gQQjggs9XMP3HH+TvumL1TEQ7E2dkZXz8/AHz9/Gj9cFtOHz8BwKljJ1jxv2UM6zeQmZOnEH0hitde7o/Zai6gb84Z3ZN/35x35M/BgwfYv38f3bqF061bOAB9+jzN+fPn7tj7Le2KLFQ+/vhj2rZtS1xcHP379yc8POfEDhw4kCNHjgCwf/9+2rZtyw8//MDSpUtp27YtO3bsAGDJkiV8/fXXtv317NmTZ599loyMDNq2bcu4ceMAOHDgAN27d6dLly707NmT+++/n+effx6A2bNnc+HCBZYtW2abuvjXX3Nu47Zo0YKOHTsSHh5OeHg4HTt2pHnz5kXGNm3aZDs2wJAhQ2jbti2QU+AMGDCg2CdRtZhI3vxT0Q3vAYo+E1PSJZ56qLq9U3F4b737CV9/t5Rp3y5hwkfT8fTyYtq3S/Dw8CQ4pDzHjhzEYsmZGe6fg3u5r1I14pKyaP9wB7Zt20pqaiqKorBy5XI6dHgMgIcLiV0vOzs715CwHTu24ePjg4+P3A0TQpQNZsXC7P0ygYjILT0tzdb3Gg1GDu7ZR5XqObN/fvbt13y9YA5fL5jDkLdHUalqFT779mvis5J4pMOjrF272vbl99q1q+nQoSMAjzzyWIGx640Z8y6//baOyMg1REauAeB///uFatXkM1VBNKqsanNbFJOBxNXfkHXiT3unctf4tuyGvm5nBn1ZvMUsR3aviJfPf0PumjWseMdm/dp3uOipkxfOn8G+vTtIT0vB29sXL28fJk/9gS8+eYeevftTvUZtTp04wsxpH6HXZ4Oq4u7hycBBo2nYuBmbNqwiNSWZp5/NmR1l/DuDSElOJCM9DT//ABo2bs7AQblnAEtMiGP8O6/bpic2mUwsmDuNM6ePo9XqCAwqx4DXRhIQGEygrxtbNq5m8eKcdXiaN2/JW2+9je7qg/0rVvzCokV5YydOHGf27FlMnTqD5ORkxowZgcFgQKvV4uPjw9ChI6hT5/4Cz0tsbBT16tW9+RMvhBAOxmA2MufAEnZE7bV3Kg5rcLU+eAUG2X4Oq3H/HZv1a//ZE0W2+/G7OezbtYf01FS8fX3w8vbm8+9m8PmEiTz9fB+q16rJqWPHmfHZFPTZ2UBO3/zq8CE0bNqEjWvWkZaSwtMv9AFg/LC3SElKJj09HT9/fxo1bcLA4UPYt2s3vyxaglarxWKx0KR5GM/2eyHP5DnHDx/hf3MX8PH0KQAEuPvy86IlrPs9p8B48slwBg583dZ+zpzv+D2f2Pbt29ixYxvjxk3I855btnygyOmJy3rfLIXKbVAVBWPcWWJ/eMfeqdxVzoEVqfDy53R7d32x2t9YqNh7HRWHp4EqoT44O93dkZll/WIohLg3WBQrp5PO8cGWr+ydikO7sVCx9zoqjk6r0VDZryI6jfTNd5M8aHAbVKuZpDWz7J3GXWdOjkE15axSfyuLP94TxcSdpEJiqp7QQA+ZUlgIIW6SoliZta9sDMcuSfdCMXEnKapKSnYagR7+aGUNlLvGYR6mL21Ui5msE7sxJUTZOxW7yDq5hydaVC66obgl2QYzZousniyEEDfDolj5K+Yf4jMT7Z2KuAddMWaiqNI3301SqNwiVVVI2bzQ3mnYTdbJPdSuWPCYSnH7EtP0JT5dsRBC3MsU1crSI6vsnYa4R6lAUnaK9M13kRQqt0C1mLnyzyasWen2TsVuDNEncXZxpmp5WbDoTjEYLRiMVuR6KIQQRbMoFvZEHyIhK8neqYh7WJZJj1kx2zuNMkMKlVugqippf66wdxr2pVjQXzhKt3Y17J3JPS0pXY+KVCpCCFEURVFYJndTxF2QlJUqd1XuEilUbpKqWMk6tQfrlRR7p2J3Wcd3EvZ/fvZO455mMlkxmqz2TkMIIRyaxWrhz+gDJGZL3yzuPIPFiNFilBEPd4EUKjdJtVpJ2xFh7zQcQvbZg3h7eeAhq9TfUcnpBhRFroZCCFEQRVVYdvQ3e6chypAUfZqMeLgL5BPmTVAVBUPUEcwpl+2dikNQ9JmYk2N4qk11lm08XezXNaxVDlc35xLPx2gwc/h0QpHt/rdwFvv27CAxMY5Pp8yjUuVqedoc+WcfEf+bR/TF83R8sht9XhyU777MZhNfTR7P+XOnAGwLOl5PVVU++2g0URfO5IpHnT/Dwh9mcCUjA4C+L71OoyYtANiycTW/RS4FVBo1bs67747F4+o5i4xczqJFC1BVePDBVowcOQattuDvHObO/Z65c79n8eIIatSoWeT5EUKI0sRstbDr4n6Ss1PtnUqp1uC+Gri6upT4fo1GE0cunS2y3eK5P7Bv124S4xP47NuvqVS1Sp42hw8eImLBIqIvRNGxSzh9X8l/gUqz2cxXEz/h3L9nAPh+ad7pqlVV5dNx7xN17nyu+IWz51j43dz/+uZX+tO4WVMANq/bwOqfl6OqKo3CmjLu7fG4ObsC0jffKVKo3ATVaiZl21J7p+FQMo/toH2j8JsqVFzdnPn3y50lnsv/vdWmWO2aNmvD45168tGEYQW2CS5XgVdef4u/9mzDbDYV2E6r1RHe5Rm8vX359KO38m3zx7oVBAWFEHXhjG2bwaBn2pfv88awcdSsVRer1Up2diYACfGXWf7zQiZ9Phsvbx+++OQdVqxcyXO9ehIXF8u8ebNZuHAJvr6+jBgxhHXr1tKpU+d8j33y5AmOHj1CaGj5fONCCFHaqSj8evx3e6dR6rm6uti1bw57sAVPdO3MxNFjC2xTLjSUV4a9wV87dxfRN2sJ79ENLx9vPh33fr5tNvy2hqBywUSdO2/bZjAYmDZpMm+MGcn/1amd0zdnZQGQEBfP8sXL+GTmV3h5e/P5hIms+G05vbs/S9xl6ZvvFBn6dRNM8RcwxZ2zdxoOJfv0PsoHlq5pimvf34DAoHKFtgktX5Eq1Wqi1ekKbafT6ajfsCkenl75xuMuX2L3ri081f25XNt379xE7Tr1qVmrrm0/3t6+APy1ZxthzVrj4+uHVqvl4Q7hbN+6CVWFzZs30q5de/z9/dFqtXTt2oONGzfke2yTycSXX37GmDEFX/SFEKK0i06/LDN93QNq16tLYHBwoW1CK5Snao3q6HSFf3zV6XTUb9IITy/PfONxMbHs3raTLr165tr+59bt1K57P/9Xp7ZtP94+PgD8tfNPwh5sgY+vb07f/ERHtm/dggaN9M13kNxRKSarMZu0PSvtnYbDMSfHoJoNtG5Unl3/yJC46ymKwtxZX9JvwDB0utz/1WIuRaHTOfHFJ++QmpJMteq16PPi63h6eZOclEBgcIitbWBQOVKSE8nIMhIXF5frG5iQkFDi4+PyPf7s2bN44olOVKhQ4c68QSGEsDO92cDa05vtnYYoRRRFYc7X39Bv8KvonHJ/GRlzMRqdk47PJ0wkNSWFajVr0HdAfzy9vUhOTCSoXDAawMPFnfur1mJN8nJUxSJ98x0kd1SKSaPRkn3mgL3TcEhZp/bQqWXesaRl3drfIqhTtxFVquUde6ooCseOHuSVQaP5+PPvcXN3Z/HCWYXuLyOr4NvcNzpy5B9OnjxOz57P3HTeQghRWmg1WvZeOmTvNEQpsubXSO5vUI+qNarniSmKwrG/DzNw+BAmTf8Kd3d3Fs/9AQCNRoOHsxtV/O6jnEcATmYTqsWEOSmm2MeWvvnmSaFSDKqikHVyD1gt9k7FIeWsUp//7dWy7OTxw+zYup7hg59j4vg3ycrMZPjg58jOziIwqBx16zfB3z8QrVZLqzYdOHfmJJBzByU5Md62n+SkBAICgzFbFIKDQ4iL++/OVXx8HCEhoXmOfejQQS5cOE/37p3p1i2cxMQEhg9/g717d9/5Ny6EEHeBVVHYe+kQJqssvieK7+TR42zfuJlh/Qby4VtjycrMYli/gWRnZxMUHEy9Rg3xDwjI6Zvbt+Xc6TOU9wqmZqWapCUmo6QnYo6/QOyFM5QLCgTFQkhwoPTNd4gUKsWgmo1kHPrD3mk4LMPFEzi7OFM5VFapv95b737C198tZdq3S5jw0XQ8vbyY9u0SPDw8afFge87+ewK9PhuAw3/vo3KVnMUzm7dsy/59u8hIT0NRFLZsWkOLVu0BeKBZG7Zt20pqaiqKorBy5XI6dHgsz7FffLE/q1dvIDJyDZGRawgOLse0ad/QosWDd+39CyHEnWS2mlh/Zpu90xClzOgP32P6j3P5esEc3v/yEzy9PPl6wRw8PDxo8VBrzpw6jVGvx9fNm6hjZ7i/9v24Wq20aViH7ds2kxIfg6IorP5jEw+3agVA22ZNRKqLNQAAIABJREFUpW++Q+QZlWJQLSaMl07aOw3HpVjQRx2le7safL3s7yKbGw3mYs8CcjOMhuJ9q7Zw/gz27d1BeloKn018Cy9vHyZP/YEvPnmHnr37U71GbU6dOMLMaR/lFBKqyu5dWxg4aDQNGzdj04ZVpKYk8/SzOdMijn9nECnJiWRlZjL0tWdo2Lg5AwflPwPYNUHBIXTu9hwfjhuCRqMluFwoA14bCUC5kAp0e/oFPhj3BgANGobR5qFHAfD0DaZ//1d45f/Zu+/wqMq0j+Pfc6am954QIEASulJCDUhTkCaor4INMbgiiuu6ttVFXVlF13URXbFgW13FgoJiYQUFUemi9JoAIQmEFELK9Hn/YMmKlIRkkjMzuT/X5XXhPGfO+U0gM3Ofp91yIwC9e/fhsstGAbBjx3ZefvlFnn123oX/8IQQwsdU2WvYU5Jb94GiXqxWW9N8NlvrN2z5zfmvsP77NRwvK+OJP806ubLW/Hk89efHuPK6SbTt0I5d27Yz78lnqKmuBtz8uHI10+6aQdceF/H10i8pLy3lyusnAfDwzHsoPVZCVWUVM66fSrceF5Fz14zzZkhKSGTy5Bv4yx8fQlUUEmJi+MOtN+MoP0JibDQ3XHUlt933JwB6du/K8EEDAYiPCJPP5iaiuN2yr+b5uB12ytd9Stk372gdxasFdx6EKftGrnv82zPa7r4iieBQWYLPUxJjgptkk82CggN06tTR4+cVQghPszpsfLRtKZ/sPPvKSqJu09tMIjgqWusYmlMUCDYEEh4Qhl7V47ZZcFYcw+2o/7xQAENEAqrZ86ugtvTPZhn6VQe320Xlz7KiSF2q920iJER2qW8Ox6usslO9EKJFUxWFlXlrtY4hfJhB1REVGE7r8BSiAyNQayqxF+XiKC244CIFwGmpxO1yNUHSlk0KlTo4T5TKTvT14Ko5UbtLvWha1TUOULROIYQQ2imuKqXMclzrGMLHKECgwUxiSCwpYYmE6kw4SguxH8nDeaIUaPhNQJe1GkWRD2dPk0LlPNwuJ1W71mkdw2dUblvN4G5xZza4ARlh6DFut5tqi2dXoJMRoEIIX+FwOViTv0nrGP6hhbz361SVCHMoqeFJxAVFY7TbsB3Jw34sH7etxjMXcTlxNaAn5nzks1kKlfNy2SzU5NY9OVycVL1nPQnRZ47PtNhdgHSHelJ1jd2jw79sNisGg8Fj5xNCiKZic9rZVLBV6xg+z+qy+X2hYtYbiQ+OJjUsiXBjEK6KEuxHcnEcLwa357+XuGoqPVpcyGezFCrnpRqMWA7Jal/1dfLOhIV+XU6fOL9p7wmslgq/f0NsTjU2p0fO43a7sVotlJcfIy4u1iPnFEKIpqRTVPaW5mkdw+f9fHwn1upqv/tsVhWFUFMwrcISSQiJxexyYy8+hL34IK6aE016bZel6oKOLyoqYteu3WzduhWLxVL7+K8/m48dK2bChAl07tyZOXPmnPNcNpuNqVOnkpWVRVZW1mlt+fn5dOzYkXHjxtX+V1ZWBsDXX3/NhAkTGD16NJdffjmvvfbaac994YUXGDZsGMOGDeOFF16od9uv/eEPf2DAgAGkp6dTVXVhPyOZ+XwetuJDDZpQ1ZJV7VrLqH7d+WHL/+b1rN9TSWKUidRYCzJ803P0zgCPjIc1GAwkJMQTFhbmgVRCCNG0dh3bj6sJ7oa3NJvKtxNvjiHFGo/iBxMf9aqOAIMZo96EraqcI7ZCXFYPDeu6kBw2B6i6eh1rs1kJDNRz7Fg5R47kn9Z7cuqzOSDAzOzZs/nyyy+x2c79nVRVVaZOnUpERAQ33XTTGe0hISEsXrz4jMdjYmJ48cUXiYuL48SJE0yYMIGuXbvSs2dP1q9fz5dffslnn30GwFVXXUXv3r3p1avXedt+68orr+TBBx+k33/3nbkQUqicg9vpoGr3eq1j+JzqnWvIGN//tMecLlj0Q4lGifzXX27tR/cOMVrHEEKIZmNz2NhYsEXrGH7BhYvPir7ROkaj6FQdvZO6MS5jBEmh8dgL9lL+9VvYCvdqlil61G2EdB96QTcShwwZwvz58+nQocMZbaduIn799dfnLVT0ej39+vUjPz//gvJ269at9s8hISGkpaVx+PBhevbsyeeff8748eMxm80AjB8/ns8//5xevXqdt+23+vZt+IaWMvTrHNx2GzW5P2sdw+fUHNx+cpf6ONmlvqlt2nUEm90zQ8CEEMIXON0uthfv1jqG0FhkQDjXdhnHq+OeYlr3q4nc/QuHn7mJo//6s6ZFCoDl4HZcNkvdBzazqqoqJkyYwIQJE3j11VfPOpdm3759bN68mT59+gBQWFhIYmJibXtCQgKFhYV1tnmS9Kici16PtWCf1il8j8tBzYFtjB+cxnP12KVeNNzWfSU4nC6Mhvp1MQshhK9TFIWD5QVaxxAaUFDoHJfO2IzhZMa0w36sgIqP/0HN3o1aRzuNtWC31y1THBsby8qVK4mKiqKkpITbbruNsLAwrrrqqtpjjh49yvTp05k1axZxcWdZwVUjUqicg634ELg8uwRsS1G1fTW9B96odQy/t//wcQx66RQVQrQcB8rzcTdirwvhe4IMgVzSti+jOwzDrDPg2LWegnf/hqvaO/fRsZcWet1eZ0ajkaioKACioqIYM2YMmzZtqi1USkpKmDJlCrfccgsjR46sfV5CQgIFBf+7MVBYWEhCQkKdbZ4k33LOwVa0X+sIPqt6r+xS3xycLjd5hRVaxxBCiGaTW3ZI6wiimbSNaMXMPjfz0rgnmJjaH+c371H0txs49uk8ry1STrEd9a5/pyUlJdjtdgBqampYsWIFGRkZAJSVlTFlyhQmT558Wg8LwGWXXcYnn3yCxWLBYrHwySef1BYy52vzJPkmeRYuuxVrUa7WMXzWyV3qCxjdvy3vL5exxE1p486jtEkMQ6+Tew5CCP9mcVg5UH5Y6xiiCRl0Bvq36snY9OFEB0bgOLCdolfuwVHiW3/vlsO7MCWmoSjn/2x+/PHHWbZsGceOHWPKlCmEh4ezdOlScnJyuPPOO+nSpQsbNmzg7rvvprLy5B4tS5cuZfbs2QwcOJB3332Xo0ePMnPmTAAmTpzIkSNHqKioIDs7m4EDBzJ79mw2btzIc889h6qqOBwOBg8ezHXXXQfAyy+/TF5eHgsXLmThwoUA3HDDDUycOJGsrCxGjBjB5ZdfDpycMN+7d2+A87YtX76cFStWMHv2bABmzJjBL7/8ApwscDp06MCCBQvq9bNU3LLt5RmcliqOfDAHy8FtWkfxWWH9JlCdPpLpz3yndRS/1qdzAnddcxFBAS17QyghhP+rslXz1OoX2VGs7WRp4XnxwTGMbH8Jl7Tpi9NajXXjfyj/YZHPDsEP7jyI6MtuQTWduQm2uDDSo3IWit6A7Zh3ddv5murd60nsN1HrGH6vuLxa6whCCNEsDDoD+RVFWscQHqIqKj0SuzA2fThtIlKwF+VS8u/Hsebv0Dpao9mO5mkdwW9IoXI2Lheuahn73xj2Y4dwO07uUv/rzR+FZxWX1ciEeiFEi+ByuzhhrdQ6hmikMHMoI9IGcln7waguF45tP3D4jVm4bc2/OWNTsR07jGIwax3DL0ihchb28iNaR/ALVTvXclnf7lKoNKGKKhuq6mXLiwghRBMorpKNg31ZZkx7xqQPo2t8JvbSIiqXvkTVjh+0jtU0XA7cdguKDP1qNK8oVE6cOEFubi5VVVWnPd6YnSwbw3ZEJtJ7QvWuNWSO7V/3gaJRjldaiQoL0DqGEEI0KZlI73sC9Gay22QxNn04wfoAHPs2U/DR7bhO+H/R6aw+IXNUPEDzQmXRokU89thjBAYGYjb/r5tMURSWL1/e7HncLif2Y/nNfl1/VHNgO3Gmk7vUHzxyQus4fqvkuEUKFSGEX7M7HeSWHdQ6hqin1PAkLu8wlL4pPXBUlVPz/WIK1y/VOlazclaWYYjwno0TfZXmhcqzzz7L3LlzGTRokNZRAHA77DhrZAysR/x3l/pxg9KY977sUt9UCkuq6NAqQusYQgg/0rt3b9atW3fG43379uXHH39s9jx2l52S6vJmv66oP72qJyv5IsZlDCchOBZ7/m6KX78f29EDWkfThKPimNYR/ILmhYrT6WTAgAFax/gflwuXFCoeU7X9e3oPvEHrGH7t8NFKXC63zFURQnjMqc3hfvuYy+XSIA243W5qHP4z2dqfxARGcln7wQxNG4DbZsH28zccXrnQZ5cW9hR7+VHcbjeKIp/NjaF5oZKTk8OLL77I9OnTUVXtVy9y48ZpkULFU6r3bSJ61O8IMOmpsbbsN62mcrSsBqvdSYBJ819nIYSPmzRpEoqiYLPZmDx58mltRUVFXHTRRRolgxq7VbNri9MpKHSL78i4jOG0j2qDrfgQxz94mprcn7WO5jWcJ0pwO2woBpPWUXya5t9s3njjDY4dO8arr75KeHj4aW3ffvutJpmkR8VzXNUV2EsLGD2gDR8s36N1HL9UXF6NyyX7tgohGu+qq67C7XazZcsWrrzyytrHFUUhKiqKPn36aJJLQZEeFS8QYgxiaNv+jOowFKOi4ty1joK3/4pLbvCewVFZhtvpBNmPuVE0L1SefvpprSOcRlF18gvnYZXbVjO4+0gpVJpIZfWZQzSEEKIhrrjiCgC6detGWlqaxmn+R1UU6VHRUPuoNoxJH8bFCZ2xVxRT9Z83KN2yUutYXs1ZWQbITcTG0rxQ6d27t9YRTqPo9DKZ3sOq96wnqd8ErWP4LYfThQyBFUJ40o4dJ3cHT0tLY//+/fz5z39GURQeeeQRTQoYVdVRY5celeZk0hkZkNqbsRnDCTeF4MjbStHLd+Eok73m6sNlswDy4dxYmhQqL774IrfddhsAc+fOPedxM2fObK5I/6OofrU7qjewFx/C7bDRt3M8P24t0jqO33E4tJncKoTwX//4xz947733AHjqqafo0qULgYGBPProo7z11lvNnkev6qhxSI9Kc0gKiWdUhyEMbN0bZ/UJLOu/pPDHxYB81lwQl1NuInqAJoVKUVHRWf/sFVxOrRP4papdaxnZr5sUKk3A6ZJVRYQQnlVaWkp0dDRWq5WNGzfy3HPPodfrNZuj4na7cbTwVaSakk5R6ZnUjfEZI0gOS8BesJ9j/5qFrUCGbDeU2+VCelQaT5NC5dFHH6398xNPPKFFhHPzgpXH/I1qDgaXky5twrn3+p5ax/E7JoMOnSxNLITwoMjISA4cOMDu3bvp0qULRqORmpoa3G5txtzbpUhpEkmh8czIuomUsAQUFJSaSqx7N4PbSXifsVrH82mqwSTfKT1A8zkqp1RWVlJWVnbaYykpKc2eQ1F1zX5N/6IS0O4igtKz0CW1h5BITEYzbpcL1a2jR0QQ9uMWrUP6FUXVec8vshDCL0yfPp0JEyag0+l49tlnAfjhhx/IyMjQOJnwlLv63ELP+E5YqipxWxzojSbU4Aj0GVm1x7hsTuzlNdjKLbgsUixeCLeiA7d8OjeW4tbq9sh/7d27l3vuuYedO3eiKMppm+OcmszXnNwuF7lzrpEhYPVkiEkhuGN/jKldcEfGYTIHU2WrYW9pLluO7GJvaR7lNceZO/xh3A4n1iILhZ80/9+rP9OHmEidcjGqUYpsIYTn1NScnK8ZEBAAQElJCS6Xi5iYmGbP4nQ5ufaDGc1+XX+UHtWW+/vPwKhT2f3zWr79+I3anrLA4DCS0zKJS2lLVHwyYeGxmM1B6MxGcLqxV1iwlVRjPVKFvawGW1kN9nILbpkreQZDmJlWN14kn82NpHmp9+ijj5KVlcVbb73F0KFDWbFiBc8884x2m0q5XSg6A24pVM6gmoMIyuxHYNrFuONT0QeGoSgqB8rz2Xp0F7tyl7O3NI8T1tNXTXt88B8o37CRgx98RPe/PY0h3Iy9XHpVPEZ6loUQTcBisbBy5UqKi4vJycnB4XBoNvRLURR0qg6nfDY3yp8H30VGZDvcbgebVy9j7X8WndZeXXmc3T+vYffPa854bkRcEsltM4hNak1kmyTCuiZiNAegMxpwWuzYyy3YjlVjLf5fEeOosLbcFXpVNPt98SeaFyo7d+7ktddew2Aw4Ha7CQkJ4d5772X06NGMGzeu2fO43S4UnR53i9+aQiUgrdt/h3B1gNBITMYAiitLWHtsL9t3fsGekjyKThzFfZ53oQ5RbUkLT2HTa3OwlZTitNQQ2a8VRz7f3Yyvxb8pOlXeDIUQHrVu3TruuOMOOnfuzKZNm8jJyeHAgQO89tprzJ8/v9nzOFxOAvRmKm1VzX5tf9AvpSe3Xnw9RoMel9PBD1+8z5Y1Ky7oHGVHDlN25PAZj6t6PfGt0khM7UB0YiqRnRIIDYrHYDKh6lQcVTZspTXYiquwlfy3F6a0GmeNfw8lU3T1v4s4Z84cvvrqKw4fPsynn35Khw4dzjhm9erV/P3vf2f37t1cf/313HfffWc9l81m47bbbmPr1q0ArF279oxj3G43U6ZMYceOHae179ixg8cff7x2KsZ9993HoEGDAHj//fd55ZVXcLvdZGdn89BDD6H+dw7O+drO5vnnn2fevHnnfK2/pnmhYjKZcDgcGAwGIiIiKCgoIDQ0lPLycm0Cud0oOs1/LM1OH5VESMcBGFt3xh2VgMkcTLXdwp6SPLYeXseeX/LILT+E3XlhFdzve95IweJPsZWUArDzqb/RadbDlKwy4qi0NcVLaXF0Zn3LvWMlhGgSf/3rX/nHP/5B37596dWrF3ByE8hffvlFkzwul4sAvUkKlQtk1Bn567D7SAiM5WjhCWITAvn6w1fZt2WDx67hcjgo2L+Lgv27zmgzBwaR1DaT+FZpRMWnEN4mjoiAFPQmI7j/O5SstOZ/Q8lKa7CX1+C2+/5QMtVU/8/moUOHcsMNNzB58uRzHpOSksLs2bP58ssvsdnO/f1JVVWmTp1KREQEN91001mPefvtt0lMTDxtikV1dTUzZszgmWeeoXv37jgcDk6cOAHAoUOHeP755/nkk08IDw8nJyeHJUuWMH78+PO2nc22bdvYvHkzSUlJ9fjJeEGh0qNHD7744gsmTJjApZdeSk5ODkajUbMlEHG7wM8LFdUYSFDHfgSkXQTxbdAFhqKqOg4eP8y2I7vYdeBb9pTkUWE90ajrDGnbn1BdALs//F/XcsUvW3DZ7ERkJVO8fH9jX4oAdAEGrSMIIfzM4cOH6du3L0DtvFGDwYDTqc3QKxcuzAazJtf2VWPTh3NV5hiKiyrZvqeQzK4xfPbmXA7vb755opbqKvZt3cC+rWcWRmFRcSS3zSAupS2RKUmEdkrAZA5AZ9LjsjqxH7dgPVZ9siemrAZ7ac3JxXh85Macaqr/3JSePeteETU1NRWAr7/++ryFil6vp1+/fuTn55+1PS8vj6VLl/Lkk0+yfPny2sc/++wzevToQffu3WvPExERAcBXX33FsGHDiIyMBOCqq65i0aJFjB8//rxtv2Wz2Xjsscd45plnuOGGG+p8zeAFhcqvN3y8++67ad++PVVVVeesxJqc241qCsCfRsEGtOlGUHofdCkdICQKkymAY1WlrC/ey/ZdX7K3JI+CE0fOO4SrIW7MHEfuq2/gsp6+Sde++fNpf8cdlPxwEJefd/02B9WsR5HliYUQHpSWlsZ3333HwIEDax/74Ycf6hym0VTcbjcBeilU6iPMFMoTQ+8nxBDCV59spXVaFOmdI/n45ScpLjigdbxax0uOcLzkCNvWrzztcVVViU1uS2KbDsQkphKZnkhUcCwGkxnVoOKosmMvq8F6tApbSXXtfBhnlXeN2deZ9F43h9TlcvHQQw8xa9Ys9PrTS4C9e/ei1+vJycnh6NGjdOrUifvuu4+wsDAKCwtJTEysPTYxMZHCwkKA87b91ty5cxk7dizJycn1zqxpoeJ0OrnppptYsGABRqMRVVU1mZfya263G31oDPbiQ5rmaCh9RALBnQZgbt0FV1QCxoAQLHYLe0vz2HJ4A3tLc8ktO4TtAodwXagpF12Nu/w4xd+uPKPt2MrvaHfbbUT0SKJktfe8afoqnVkPUqgIITzogQceYNq0aQwePBiLxcKf//xnVqxYwT//+U/NMgUYTJpd21fcfNHVDEkdSN7eY7z5wQaumNSVuAQzH7zwGMdLjmodr15cLhdFB/dSdHDvGW1Gs5mkNhnEp7YnOj6F8NQ4IgKSTw4lUxQcp4aSHa3CVnqyiLGXWXDZmv/2sy7QgOJl+6gsWLCAXr16kZmZeUaPi8vlYs2aNbz33ntER0fzxBNP8OSTT3psv8OffvqJrVu3cs8991zQ8zQtVHQ6Hfn5+bhc3jMWUVH16MOitY5RP0YzwZn9CEzrAQmt0QWGodPpOFhecHIVroOr2Fuax3FLRbPGCjQEMrxVH3Y8OhvOMcn74LvvkXrddZSuzcdt96f+q+anCzCg6r3rzVAI4ds2bNjAkiVLWLJkCRMnTiQhIYEPP/yQL774gq5duzZ7HgUFs/SonFOrsCT+nP179G4Di975id07jjDtrv6YTTYWzptF9YnjWkf0CJvFQu6OzeTu2HxGW2hEDElt008OJUtMJiwjDmNAIHqjAZfdhf34b1YlOzWUzNU0Y8n0QUav+2zesGEDu3btYvHixTgcDioqKhgyZAhLliwhISGBrKwsYmNjARgzZgwPPvggAAkJCRQUFNSep6CggISEhDrbfm39+vXs27ePoUOHAlBUVMTUqVN54oknGDBgwDkzaz706/bbb+eRRx7hjjvuID4+vnYsLHDeFQOaimIwog+Lbfbr1oc5tQtBGVnoUzIgNAqTKZBjVWVsOLaX7buXsbckj8MnijRfAeruPlOp2LaDiu3nHgdbsHgJrSZPIqx7POXrz1xFRNSfPkzuMgohPOuFF15g6tSp5OTknPb4iy++yJQpU5o9j16nJyIgrNmv6wv+OOB3dIvpxLafCvhq8Xbcbjd33J+NraaE959/Gpu1RuuIzaKirJiKjcXs2Lj69AZVJSahFUlt04lNbE1k+0QiglMwmsyoRj3Oahu2csvJuTDHqk8WMGU1jV7wRx/qfZ/NL730Uu2f8/PzmThxIitWnFz9beTIkeTk5FBZWUlwcDCrVq0iPT0dgEsvvZTJkyczY8YMwsPD+eCDDxg9enSdbb82bdo0pk2bVvv/Q4YMYf78+d6/6tdDDz0EwOLFi2sfO7XpoxYbPiqKgjG6fisRNCV9eBzBnQZgat0VV3QCJnMIVqeNfaUH2FL4E3u25pJbdhCr07tWzkoJS6RTVBqbH/19nccWffEF8ZeN4vimAtxOH5kd54WMUYFaRxBC+Ikff/wR+N8wkF/f+MrPzycoKEiTXEadgaTQeE2u7a26xWfy+97TsFvcvPPSWg7llREcYmTa3f0pKdrP5/96DqdD5oHiclF8OI/iw3lnNOmNRhJbp5OQ2o6YhFaEd00gLDARg8kEqoLjhBXbf+fD2EuqT07qL6vBZa17JIghIqDeER9//HGWLVvGsWPHmDJlCuHh4SxdupScnBzuvPNOunTpwoYNG7j77ruprKzE7XazdOlSZs+ezcCBA3n33Xc5evQoM2fOBGDixIkcOXKEiooKsrOzGThwILNnzz5vhsTERHJycrjmmmtQFIXk5GT+8pe/ACdXHJs+fTpXX301AP3792fs2LF1tm3ZsoXnnnuOV155pd4/i9/SfGf6V199lZEjR572mNvtZtmyZdx8882aZLIeyePwq39ovgvqTQRn9iWwfY+Tq3AFhaHTGcg/XsDWI7vZVbKPPSW5lDfzEK6GeG74w7i+/4ncV1+r1/F93l9Iyao8jv9c1MTJ/FfbGVnozLLylxCi8YYMGQKcnCD76+EbiqIQExNDTk5O7dCN5rbj6B5mffN3Ta7tTXToeGzIH2gd1oq1q/azctkenE4XUTFB3HxnFgd2bmb5Rwtwe9Gwel8UFBpBcloGcSlpRMUlExYWgykgCL3JgMvhOrm08qlVyUotJ+fDlNfU3nhte3uWrMrpAZoXKhdffDGbNm064/HevXuzbt06DRKBs7qCA882Xde2uVUngjL6nBzCFRaNyRRIaXU5O4/tY9vR3ewtzSO/olDzIVwXqm9KD+7oNokNt9yKs7q6Xs9p//uZRPbqQ+78dT6z5KA3UfQqaXf28boJe0II33bvvffy1FNPaR3jNCXVZdz26YNax9DUJW36MaXr/3G81MKid36iuOjkNgLJrSO4LqcHW9eu4IcvP9A4pf+Lik85OZQsqQ2RMYmEhERiMAegM+px1tixl9dgTgiVFTk9QLOhX6e6l51Op1d1LwOo5iBQ1JN7qjSSPiyGoE4DMbfpgjs6CaM5BJvTzr6yA2w98jO7t+eyv+wgVoe17pN5uWldruLAv96pd5ECsOfZufR5vx8hGTGc2FHchOn8kyHcjMvuQmeSQkUI4TneVqQAhJlDURUVlwc+m31NgD6AJ4fdT1RAJCs+38X61bm1a9VkdI7nisldWLPsI37+/j/aBm0hSooOUVJ05uqwer2R+NR2pHXuSUZMPwxG75un4ms0K1T+9Kc/ASc3fzm1qgD8r3v51NwVLbgddvThsTjKLnA4kt5IUEYfgtr3RElogxIUjkFnIP94IT8W72bn5h/ZU5pLWY1/rL7xa//XeSy6agtFXy274Oce37aVqAEZUqg0gCG8/mNghRDCl9mdduKDYyg4cUTrKM3qqk6jGdt+BIWHjvPPd7+lotxS29ajXytGjElnxaI32PPzGg1TCgCHw0b+vu2YAgJJ767RxuV+RrNC5dQqA97YvYzbhSm+bZ2Fiik5k6DMvhhaZeIOi8JsCqKs+jg/HdvH9r0r2FNycgiXv9/9MeqMjG07iN1znoEGjInd8ehf6PP+QoLSIqnaV9oECf2XMSrA65Y/FEKIpuDGTau8IG2ZAAAgAElEQVSwpBZTqMQERvH4JX8kQBfIZx9sYdtPBae1D76sA30HpfL52/M4tGebRinF2UTGJaGX3hSP0HzVL68rUgDFYMaU2I6qHT/UPqYLjT65kWKbrrijkzEEhOBwOdhfeoAtR39hz45c9pUewOIHQ7gu1MzeN1G9P5fyn85c17y+qg/mETUwVQqVCxSYGo6ik0JFCOH/zDoTqeFJrMk/c16rv5ne6wb6Jfdiz/ajfL5oDTW/2XV9zNVd6Ngtho9fmcPR/FyNUopzSWqbockWG/5I80LFGymqSlCH3uhCIlES0lCDwjHojRyuKGTN0d3s/GUte0vyKKkp0zqq5uKCY7g4riM/P3Vvo87zy0MP0/dfbxOQEkbNIf8bGtdUTHHBWkcQQohmoaoq7aPaaB2jSbWLSOXBAXeCU+X91zewf/exM4659paeJKUE8sELf6H8mKyY6Y1ik1prHcFvSKFyDrrwOLZUFrB1/zfsKcnj0PECvx/C1RB/zJpK8arvqDmU37gTWWxYjhYRNTCV/H//4plwfk4fapIVRYQQLUq7qNZaR2gyf8q+g45R6fy07iDLl+7EbvvNXh0q3HJnX4KCXCycN4uqinJtgorzCouKPW3zctE4Uqicg9Vp46Ptn3PoeEHdB7dQ3eI7kRgUy8a3HvHI+bY88Cd6vbYAU3ww1qJKj5zTn5kTQnC7ZE1nIUTLoaCQEpboV5/NvZO6c3vPKdRU2nnznz9QcJZRBXq9yu/u6Y/TXs77zz+NtaZKg6SiPuJbpckeNh4khco5qIpKRnSaX70ZetqMiyaR//6HOCo8sxGlo+IEtuPlRA1IpeBDmRhYF3NSKKpRp3UMIYRoNjpFpWtcpl98NutUHU8MvZ+k4ARWL9/L9yv24jrLzafAICO3/qEf5cUHWfrWXBx2mwZpRX0ltc3EaJYVOT1FZvqcg0lvpFt8ptYxvNbYjOGYHVCw5DOPnnfbQ7MISArFGCW/5HUJah0u3ctCiBbFqDfSK6mb1jEabWT7Ibw+5lmMVcG89Mwqvvt6z1mLlIioQG6/bwCFedv49PW/S5HiA5LapGsdwa9Ij8p5ZMS01zqCV1JVlavbX8q+uS/gdjg8em5LYSGO6moi+6VS9OlOj57bn+gCDehDzVrHEEKIZtcuqjU6RcXpg/NGw4zB/HXYA4QZQ1m2ZDub1h6Ec4zgTWwVxvW39mLHhlWsXvpu8wYVDRIQHEpwWITWMfyK9Kich1FnoFVYktYxvM70XtdjPVxI6Zq1TXL+HbP/SlBaBPpQWYP8XAJbRzRozxohhPB1DpeDdj64+tcN3a/khVFPcDzfwfNPfsOmNecuUtpnxnLD73qz8dtPpUjxIakduuB0Ous+UNSbFCrnoVd09GvVU+sYXiXCHEb/+O7s++f8JrtG5a7dOK1WIvukNNk1fF1IejSqUTpEhRAtj1Fn8Kmh2Ykhcbwy5mmGpgzkk3c38+9X11FZce491y7qncKVN3Rj1ZJ/sWnl582YVDRWh259MJpktIMnSaFyHnqdnkGts7SO4VXu6ZND6fqNVO1v2g2m9sx9jpDMGHSBhia9jk9SIKBVmNYphBBCE3pV7zPzVH7f9xaeGvYQedvKeG72CnZuOf++JwOHteOyKzL48t//ZOem75sppfAEVacnsU0HrWP4HbklW4cgYyDJoQnkVxRqHUVz6VFtaRuWzKbXnmjya5WtXY/LYSeiVxLHVuY1+fV8iTkxVJYlFkK0aIkh8QQaAqi212gd5aw6xaZzT5/f4bS6efeVdRzYX1rnc0ZO6Ey3HnEsXvA3ig7ubYaUwpOS2nTA6XSgNxi1juJXpEelDjoZ/lXr9z1v4vDixdhKy5rlenlvvElY9wRUkyzB+2shGdGoBvmZCCFaLqfbSf9WvbSOcVaPDfkDD/a/g81r8pn3xDf1KlL+b0oPOneP4sMXH5cixUe169ILg1GGfXmaFCp1MOj0ZMvwL4a2HUCIaubwhx832zWPfLkMt9NJ2MWJzXZNr6cqhHSMlR3phRAtmllvYmT7wVrHOM2g1CzeGjeXKGccC55bzfLPduJ01L3oyc139CEx2cjCeY9QetT394dpiVRVR/uuWaiqfK32NBn6VQ9hphASQ+IoOHFE6yiaUFG5seNYcl9+HZeteddwz1/8CcnjJ1C+/jDuerzh+7vA1uFaRxBCCK8QExRFUmg8hyvOP++jqQXojMwe9iCxgVF8+9Vu1q7cj7seo3NVvcrv7u6H4q5k4bw5WKormz6saBKp6V21juC3pPSrB0VR6JvSQ+sYmply8VW4SsspXrmq2a+d/++F4HYT2jWu2a/tjcK7Jchu9EIIAehUleFpAzXNcEXmSF4e8zdcpQbmP72SNd/Wr0gxB+i584FsrNVH+PDFx6VI8XFd+g2V3eibiBQq9WDUGRjStr/WMTQRbAxkaEof9r4wn3q9+zaBo6u+JapvK2jhw51Uk46AVNmNXggh4OTqX4Na90GnNP9XmciAMF68/AkmdBjJ5x9t5Y0XfqC8tH4T+8MiAphxfzbFh3exeMHT2G3nXqpYeL+AoBASW8tqX01FCpV6CjYG0jXOd9Zt95S7s26hYstWTuzQbpf4/S/MB1UhtGOMZhm8QXB6DMhqX0IIUUtRFLondGrWa07rOZnnLn2co7k1zHviG7ZsPFzv58YnhfK7P/Rj39Y1fPHOC7hkc0Cf16F7X9yyAXOTkUKlngIMZq7uPFrrGM2qdXgymVFt2f/yAq2jUP7TRiL7p0IL7kyIzEqWYV9CCPErgYYALmumSfVtIlqxYMwz9I3rxYdvbeSDNzZSXVn/eZttO0Rz0+1Z/LT6C1Yu/pdmoxSEZ3XrPxyD0aR1DL8lk+kvQGp4Mq3Ckjh4vP53T3zZH3rdTNGXy7AePap1FHY+8RR93l9IcPsoKneXaB2n2QW2DkcNkF9XIYT4rcyY9oSaQqiwnmiya9w/8Ha6RGfw84Z8/vPpDuy2C+sJ6dIjidFXduS7pe+yfd3KJkopmlurDl0wBwZrHcOvSY/KBdCrOiZ0HKl1jGbRv1VPokyhHHp3odZRalXu20PUgNZax9BEZL9W6IxSqAghxBncbsakD22SU/dI7MKb4/5BG1Nb3npxDZ9/tPWCi5R+l6Qx+sqOLFv4khQpfiZr2HiMJtk7pSlJoXIBdKqOnkldiDCHaR2lyeV0uZK8t97GWeM9u/5uffgRdMGGFrdErzE6EFNMkNYxhBDCKxn1Ri5rP5hgo+feJ3XoeGL4/dyddSvrVuXxwpxvOHyw/ILPM2JcJtnD2/Dp638nd/tPHssntBeb1JrIuCStY/g9KVQumMLlTXTnxltM6joOpdLCkWVfax3ldA4HlsLDRA1I1TpJs4rsk4Kia8GTc4QQog4KCuMyRnjkXJe2y+b1cc8SWB3Oy8+uYuVXe3A5L3w+ycTru9O9ZzwfzZ9NQd5uj2QT3qPX0HHo9AatY/g9KVQukFFnYHjaQEx6/5w4ZdQZubz1IPa/+BJ44SoWm+99EGNkIObEEK2jNAt9qImgdlEostutEEKc08lelUGENKJXJdgYxLyRf+H6zlex4vOdvPKP7yg5WtWgc904PYvUNkEsfP4RSoryG5xJeKfQiBhS2nWUneibgfyEG0BRFIb56b4qd2XdTNXefZRv/lnrKGdnsWAtO9ZielWiB7VBtk0RQoi6KSiMy7y0Qc+d1HU8L456gqoiNy/M+ZYN3x+ABizKpapw2z0DCAmxs3DeLE6UHWtQHuHd+l52ldxAbCbyU24As97EVZ1HE2QI1DqKRyUEx3BRbAb757+sdZTz+uXBhzEnhGD083kbxuhAgtpGoOjk11QIIepi1Bu5tN2F9arEB8fw8uinuCx1CEve+4W3X1rLieOWhl3frOeOBwbhsB3jg3/+hZqqpluFTGgnOiGF1hnd0OlkgZvmIN+AGkiv6pnUdZzWMTzqnqwcir9dRU2+dy+/7DhWgqPyBFH9W2kdpUnFDGkrRYoQQlwARVEYn3lZvY69s88U/jb8zxzaeZzn/rqCHb8UNvi6IWFm7rh/IKVH9vHxK3OwWxtW7AjvN2js9ej0UqQ0F/kW1EBGnYHs1n1ICUvUOopHXBTfmYSgaA78622to9TL1kceIzA1HEO4fy4LaE4OxZwQgqLKuC8hhKgvo87AiHbZhJjOvbdFRnQ7Xh/7d7qFd+W9Bev5+N+bsdTYG3zNmIQQbvtjf/J2buTzfz2Hy+lo8LmEd0tp34nohBSZm9KM5CfdCAZVz609J2sdwyNuv+haDr33Po4TlVpHqZeavAM4LTVE9k3ROkqTiB2ahmqQXeiFEOJCKYrCDd0nnrVt1uC7eHjgXWxZV8i8J1aQt7dxGwinpkUxdUYWW378mm8WvY5bdpv3X4rCoLHXY5B9U5qVFCqNoKoqrcKT6JXUTesojTI+41JMdij87HOto1yQnU8/Q3B6NLpgo9ZRPCq0cyyGMHkjFEKIhjDqDPRJvpgOUW1rH+uX0pO3xs0lTkni9ee/Z9mS7TjsjVvZslP3BCbdcjE/fvUBa/+zqLGxhZfr1DObwBD/30fP20ih0khmvYlpPSdh0PnmWtp6Vc+V7Yez/6VXcDt8q7u64udfcNnsRGYlax3FY9QAPTGXtEU1Sm+KEEI0lElv5I4+NxGgN/PMpQ9ze88b+X75Pl58eiVFhysaff7eA9sw9v868/WHr7JlzQoPJBbeLCgknP6XXyO70GtAChUPMOlNjPfQRlPNbXqv67EeLqB07TqtozTIvpdeJrRzHKrZPya2xQ5JA5lAL4QQjRZmDuXl0U+jlAcw/2+r+H7FPtyuxg/NGnp5BkNGpvHZm3PZt2WDB5IKbzdk4s0ygV4j8o3IA8x6E2MzRhATGKl1lAsSGRBO34Ru7HvhJa2jNNixb1fidjiJ6JmkdZRGC0wNJ6hdJKq+7l/LOXPmMGTIENLT09m9++w7Hq9evZoJEybQuXNn5syZc85z2Ww2pk6dSlZWFllZWWc9xu12c9NNN53RvmPHDiZPnsyoUaMYNWoUK1eurG17//33GT58OMOGDeOxxx7D9asNRM/XdjbPP//8eV+rEEL8lllvQqcqfPLuZspKqj1yzvHXdqNH3wQ+fvlJDu/f4ZFzCu/WtlMPEtu0l+WINSKFiofoVR1/6H8rig/tzvfHPjmUrF1HVW6u1lEa5eDC9wi/OBHFhyefKwaVuFEd6j2BfujQobzzzjskJZ27QEtJSWH27NlMnTr1vOdSVZWpU6fyxhtvnPOYt99+m8TE01e4q66uZsaMGfzxj3/k888/Z8mSJXTt2hWAQ4cO8fzzz7Nw4UKWLVvGgQMHWLJkSZ1tZ7Nt2zY2b9583tcqhBBno+oUJl5/MXjgo/m6W3uTlh7KBy88RnHBgcafUHg9ozmAIROmYDDKkC+tSKHiITpVR2JoHBM6jtQ6Sr1kxrSndWgSea+9oXWURiv4eDFut4uw7vFaR2mwuBHtL2heSs+ePUlISDjvMampqWRmZqKvo7tar9fTr18/QkJCztqel5fH0qVLmTZt2mmPf/bZZ/To0YPu3bvXniciIgKAr776imHDhhEZGYmqqlx11VV8/vnndbb9ls1m47HHHuORRx4572sQQoizUVWVqJggsga0acRJ4Na7+xMZ6WbhvFkcLznquYDCq10y/kb0Bv9asMfXSKHiQWa9ifEZl9I2wvs3IpzZ43oOf/wJ9rJyraN4xJGvviQyKwVF5zs9WqeEZMacHPLlhT1CLpeLhx56iFmzZp1R8Ozduxe9Xk9OTg7jxo3jwQcf5Pjx4wAUFhae1gOTmJhIYWFhnW2/NXfuXMaOHUtysv8smCCEaF5Gk54ho9KJTTj7zZjzPteo5477s8FdzvvPP0r1ieNNkFB4o4yL+9M6szt6g28uluQvpFDxMJPeyL0DbyNA773dhMPTsglRTBxe9InWUTwmd8EboEBI5zito1wQQ7iZ2OHtvLJIAViwYAG9evUiMzPzjDaXy8WaNWuYPXs2H3/8MUFBQTz55JMeu/ZPP/3E1q1bmTRpksfOKYRomfQGHddO7Y3hAnqug0OMzHhgIBUlB1j00l+xWWuaMKHwJhExCQwadz0Go0nrKC1egwuV5prMm5+fT8eOHRk3blztf2VlZcDJibxXXHEF48aN4/LLL+fhhx/GZrPVPtcTk3mtViuzZs1ixIgRjBkzhocffrjOn02wMYg7+95c53FaUFG5PnM0ua++hutXPyt/ULLmB6L6tfLIWOTmoOgUEid0RKnH5HmtbNiwgY8//pghQ4YwadIkKioqGDJkCJWVlSQkJJCVlUVsbCyqqjJmzBi2bNkCQEJCAgUFBbXnKSgoqB2qdr62X1u/fj379u1j6NChDBkyhKKiIqZOncrq1aub+FULIfyNoigEBhkZf233eh0fFRPEbfcOIH/vz3z25j9w+tjy/aLh9AYjo2+6C51eelK8QYO/ITXnZN6QkBAWL15c+9+pcfBt2rRh4cKFLF68mE8//ZTy8nLee+89wHOTeZ9++mlMJhNfffUVn376KTNnzqzrR4NRZ6BTbAdGdRhS57HNberF/4ezpIziVf73ZW/P3+ei6FVC0mO0jlIv0Ze0RR9iQlG9t7J66aWX+Pbbb1mxYgX//ve/CQ0NZcWKFQQHBzNy5Eh++eUXKisrAVi1ahXp6ekAXHrppXz99deUlpbicrn44IMPGDlyZJ1tvzZt2jRWr17NihUrWLFiBfHx8SxYsIABAwY03w9ACOE3DEYdaekx9Orf+rzHJbeOIOeuvuxY/y1ff/AK7jpWJRT+5ZIrbiQwJBxV9d6biC1Jg/8WmnMy77mYzWaMxpOTnBwOBxaLpfYflicm81ZVVfHJJ58wc+bM2tW8oqOj65dNb+LaLuNoH9WICXweFmoM5pKU3ux7YT64G7+WvDeq2L6NqIGpWseoU2iXOEI7xTZ4yNfjjz9OdnY2RUVFTJkyhcsvvxyAnJyc2l6NDRs2kJ2dzeuvv857771HdnY23333HQDvvvsuc+fOrT3fxIkTueaaa6ioqCA7O5s//elPdWZITEwkJyeHa665hjFjxrBt2zYeeOAB4ORNiunTp3P11VczYsQIkpOTGTt2bJ1tW7ZsIScnp0E/EyHESd4w4uHIkSNcf/319OjRgwkTJpxxXk+MeKjP6/wto0nPsNEZJLeOOGt7Rud4rr+1J2u//pgfvvygXucU/iOz50DaduqBQSbQew2fWBS6qqqq9o1u1KhRTJ06tbZwOHLkCNOmTePgwYMMGjSIq6++GvDMZN5Dhw4RHh7O888/z9q1awkKCmLmzJn07NmzXrlNeiMPZM/ggWVPcKTqWMNevAf9vs9Ujv+yhRM7d2kdpclsf+Qx+ry/kKC2kVTtL9U6zlkFpIQRM6Rto+alPPTQQzz00ENnPP7KK6/U/rlnz56sWrXqrM+/9tprT/v/jz76qM5rJicns3bt2tMeGz9+POPHjz/r8ddccw3XXHPNBbV16dLltNfwaytWyO7PQtTH0KFDueGGG5g8efI5jzk14uHLL788bcj0b50a8RAREcFNN910RvupEQ+/FRgYyMyZM6msrOS55547re3UqIZPPvmE8PBwcnJyWLJkCePHjz9vW0Ne59kYjHqundqL+U+v4kSFpfbxHv1aMWJMOt98/Aa7N6+5oHMK35fYJp3sMddhMEqR4k28vl8rNjaWlStXsmjRIl555RWWLVvGhx9+WNseFxfH4sWL+f7777Hb7fznP//x2LWdTieHDh2iY8eOLFq0iHvuuYc77rijdqhLfQTqzTw29B7CzKEey9UQbSJSyIhsQ+4rCzTN0RyqD+YRle2dvSqGyAASr8j02snzQgjf5w0jHkJCQujZsycBAQFntHlq+fL6vM5zMZr03Hh7X0zmk69/8GXpjBiTzhdvPy9FSgsUFhXH5TfMlCLFC3l9oWI0GomKigIgKiqKMWPGsGnTpjOOCwwMZNSoUXz66aeAZybzJiQkoNfrGT16NADdunUjIiKC3AvYIFFVVUKMQTw65G4CDNqtBHZ3r5sp+uJLrEeLNcvQXH556GEMoWYCUsK0jnIaNUBP8tWdUfRSpAgh/MOpEQ8TJkzg1VdfxV2PYcWeWr68MXQ6ldAwM5Nyshh7TVeyBibx8StzOLhnq8evJbxbQFAIE6bdLyt8eSmvL1RKSkqw2+0A1NTUsGLFCjIyMoCT3cenuqxtNhvLly+nQ4cOgGcm80ZGRpKVlcX3338PQG5uLiUlJaSmXtjder1OT3RgJA8PmolBbf7RdgNa9SbKGMKh91rIeFuLDUvxEaIGeE+vimJQSbqyM7oAg1dPnhdCiPqqa8SDt9MbdMQnhdDl4iQ+fHE2R/PrfxNS+Ae9wcj4W+7FHBQsk+e9VIP/VpprMu/GjRu54oorGDt2LBMnTiQzM5PrrrsOgE2bNjFx4kTGjh3LhAkTCAsLY/r06YDnJvM++uijvPTSS4wZM4a7776bp556itDQCx/GZdQZSAlL5J4Bt9bOr2kut3SZQN6b/8JZ03LWgN9y/4OYYoMwxQVrHQVFr5J8dReMUYEoOnkjFEL4h/qOePgtT4x48BSDQY/L6eCiAZc22TWEdzpVpIRFxaLT+cSU7RZJcdenn1Z4jNVh5cdDP/HPdW82y/Umd72CEeFd2TT9DmhhSyz2eO1lnBUqBR9t0yyDolNIuroLptggmZcihGhWQ4YMYf78+bUjDc5m3rx5VFdXc9999533XPn5+UycOPG0BTVKSkoIDQ3FYDBQU1PD9OnTGTx4MDfeeGPtMWvXrmXOnDksWrSo9rFDhw4xefLk0ybMjx49miuuuOK8bY15nXWx26xsXfst33/+XoPPIXzHqSIlOiEFvazw5dXk9m4zM+lN9Em5iOu6nblco6eZ9WZGtR7IvhdfanFFCsC2hx8hIDkUQ+SZkzmbg6JTSLyysxQpQohm5Q0jHpxOJ9nZ2cycOZPdu3eTnZ3NvHnzAM+NeDjX62wIg9FE56zBDBp3XYPPIXyDTm9g3NQ/EiVFik+QHhWNWBxWfjy4kZc2vIPL3TRFxH39byOtVGHbw480yfl9Qa9/vYG1wELRZ827JLOiU0iY0JGAxFApUoQQwkfYbVb2bdvI8g/qtzCA8C06vYHxt/yRmMRUKVJ8hPSoaMSsN9GvVQ8ezJ6BUWfw+PkTQuLoHpPO/vkve/zcvmTHX58kqF0k+tDmW81DNelIvqarFClCCOFjDEYTaZ16cNnk21FVef/2J0ZzAFfk3CdFio+RQkVDJr2JjJg0Zg+7lxBjkEfPfW/WLRz95ltqDhfUfbAfq9yxE6fVSmSflGa5nj7ESKvrL8IYI8O9hBDCFxmMJlq178zoG+9Cp/f8jUTR/ELCo7jmjkeJTmwlRYqPkUJFY0adkaSQeJ669E/EBkV75Jw9ErsSFxjFwbf/7ZHz+bq9zz1PSGYMusCm/cAxRgfS6oaL0IUYUfXyqyWEEL7KYDSR0LoDV9/+ZwJDvGtPLnFhYhJT+b87HiE4LBK9FJ4+R75NeQG9Tk+EOYw5Ix6gbUSrRp9verdrOPTuQhwnKj2QzveVrlmLy2EnoldSk10jICWMlEndUM16VFmCWAghfJ7BaCQ8Jp5rZz5OTKL37Msl6i81vSsTbn0Ac2Awqk5GOfgi+UblJVRVJcgYyCND7ubihM4NPs8VmZdhtLsoXPqFB9P5vrw33iSsewKqyfNvVBFZySRO6Ihq1DX7HjlCCCGajk6nJyAomAm3PkC7rr21jiMuwMWDRnHZpOmy47yPk1W/vJDVYeXb3DW8uflDHC5HvZ+nV/W8cfmT7Hv2OUrXbWjChL6pz8J3KdtQSNmaQx45n2rSET8m4+SkeaPcqRFCCH9mt1nZ8uNyfvzqQ1kRzIuZAoK4bNJ04lulSZHiB6RHxQuZ9CYGt+nDM5c9TFJofL2fN6PXDVjyD0uRcg6HFy8holcSigfmj5hig0id0oOA5DApUoQQogUwGE106TOEq26fRUh4lNZxxFnEJrdh8t2zSWjdXooUPyE9Kl7M5XJhd9l5++eP+WrvyvMeGx0YwbwRs9hy34NU5eY1T0Af1Of99yj5/iDHNxU2+Bxh3eOJHtQGRa/KUC8hhGhhXE4nDoedbz5+gz0/r9U6jvivbv1H0GfERAxGWdXLn0ih4gMsdiu7S/Yzd81rnLCefYL8k0PuJXjbIfY8O/es7eKktNtvI2bgIPa/uA5cF/ZPXx9mImF0BsaoQOlFEUKIFs5us3Jg1y8s/+g17FaL1nFarNCIGEZccytRcUkYTGat4wgPk0LFR9idDqxOK8/+8Cpbjuw8ra1TTHse6jedjbfejr28XKOEvqPP+wspXrGfE9uO1vs5YRcnED2wNYpOQVFlxKQQQghw2G3YrBaWf7iAA7t+0TpOi6IoCt36jyBr+BXodHpZ1ctPSaHiY6wOG5sKtvDaT+9z3FIBwEuX/YWKpV+Tv/ADjdP5howH7yO0YzfyXl4PdfzrN4SbiR+djjFSelGEEEKcnd1mpSB3N998/AaVx0u1juP3ImITufTa3xEWESO9KH5OChUfZHc6cLqdfLhtKQ6nk0lpI9h4y+9w2WxaR/MZfT5YyNEv9lC5p+Ss7apRR2T/VoR1jZdeFCGEEHVyOh24nE7WLV/Mz6uX4XI5tY7kd4zmAHoPHUen3peg0+tR5bPZ70mh4sMsDit6m5Mjy74m7/U3tY7jUzo/+TgBcakceG3T6Q0KhHWLJ2pgaxRVQTVIL4oQQoj6s9usVJ84zqpP35HhYB6iqjo697mEPsMnoKo69DJhvsWQQsUPOGtqqCksYv9Lr3Bi5y6t4/gGvZ4+775D0ZKdVOednNcT2CaC2GFp6AIMMsxLCCFEo9isFo6XHOW7z/5NQa58NjdU244Xkz32OkzmABnm1QJJoeIn3G43LqRoGoUAAAtPSURBVKuVih07yV3wBjWHPLOpoT/r/tzf0ZkiKfn+INHZrTGEmaVAEUII4VF2m5VjhYf47rN/czQ/V+s4vkFRaJPZnT4jJhASHo1RCpQWSwoVP+NyOnE7HFQfOEj+Rx9Tum49uFxax/I6ik5H9OBBtLvtd+BGChQhhBBNxu1y4XDYKT58gA3ffMrBPVu1juSVVJ2O9O596T1sPKaAIClQhBQq/sxRXY3b6aRw6RcUffGVLF0M6AICiLt0OMkTr0A1GNAFBGgdSQghRAtis1qw1lSzceVn7Nz4PQ67LIRjMgfSsVc2PQZfjqrTS4Eiakmh0gI4rVYURaF8888c/ngxFdt3aB2peakqoRkZxFwyiJjsAQDozPImKIQQQjs2qwUF2L7hO7au+5ayowVaR2peikJKWiZd+w4jpX1n3G4XBqNJ61TCy0ih0oK4XS5cViv2igryP/qEku9/wFF59p3ufZ6iEJqZQfSgbGIG9AdVQTWZZEMoIYQQXsXpsONyuag+cZxt675l989r/XovlpCIaDr2zKZz1mB0Oj0GkxlFUbSOJbyUFCotlLOmBkWvp6agkJIf11C2YSOVe/eBL/9zUBRCMtKJGXyqOFGlOBFCCOEzTg0DKztayLb1K8nb+bNfFC2RsYmkdelJeve+BIdFgqKg1xu0jiV8gBQqApfdjstuPzk87JctlPywhvKfNmM/flzraHXSh4YS0r4dEb16ED1gAIpOihMhhBC+z261oKgq1ZUV7N+2kYO7t1KQt9sn5rSYA4OJT21H244X0yazO3qDEUVVpTgRF0wKFXEGR3U1qsGAtbiYkjVrObFjF9X5+ViKjmi6gpguIIDgdmkEt29HWJfOBKeloQsMwGWzoZrNUpwIIYTwSy6XC7vNgl5vpKy4kILcXRQd2sfR/DzKS45oOxpCUYiISSChVTuS23UksU0HAgKDcTjsGI1mFNk9XjSCFCrivFwOBy6rFUVVUY1GbGVl1Bw+TOXe/VQfOkTNoXxqDhfgrKnxzAVVFUNoKIawUAxhYQSmJBPauRMhHTpgCA/DZbWiGo2oBrkrI4QQomVyu1zYbVZQFFRVR1lxIUUH91JWXEhF2TFOlB6joqz45DEeotPrCYuKIyI6nojYRGKSUomMTSIkPAqXywkgq3UJj5NCRTSI2+XCabGA241qMuGyWLCVluGorsZZVYW9shLHiUoclZW4HQ7cLhdupxO30wlu0IeGYIqOwhgZiSE8HENoCPqgIFSjEZfd/t/j3Ch6PTqTrAIihBBC1MVh///27jakqb6PA/h3c9aubKkJlVYEYZZQ9EZdVIac2Xp0rYVBWmGNggwVfBOW9GAumi8MxJ7QwBc90JMup6JShxQDNfVFXVFZgZEpRiW3l3O6bLtfeDl8nPd1X03H+n7ejHN+/3P2PwNx37PfOceGnz8HIYEEMt9Z+Dn4A5a//gNbvxUD/X0YsFpg7etFv+Uv2Pqtfwcd6dDJSIkUUh8ZfGQyzFH4w29eIOb4KSCfMxez5H9A6iPDoG0ADjjg68sWa5oeDCrkdg67HXA4/n4FAAckvr68ywcREdEMcDgczl9BJBIpJBLJtP5PNhqNqKqqwufPn2E2mxEWFjZuTF1dHXJzc9Ha2ooDBw7gxIkTE+7LZrPh2LFj+PPPoYdoNjQ0OGvt7e1Qq9VYsWKFc11RURECAwOdywMDA9DpdJg9ezaKi4ud6y9fvoySkhIAwO7du3H8+PH/qTaRjIwMFBcXo6WlBX5+fi7H0miymZ4Aeb/h/lQJz74QERHNOIlEAh+fmfsKqFKpcPDgQSQmJk46ZunSpTAYDKisrITNNvkNBKRSKfR6PQIDA5GUlDSurlAo8OjRo0m3v3TpEtauXYs3b9441z1//hyVlZUoKysDAMTHxyMqKgqRkZEuaxMRRZEnZv8FXuFERERERNMmIiICwcHBLscsW7YM4eHhkMlcByqZTIb169dDoVD843k0NTWhra0Nu3btGrW+oqICWq0WcrkccrkcWq0WFRUVU9bG6u7uRn5+PjIyMv7x3GgIgwoREREReSWLxQKdTgedTofCwkIMX/HQ19eHCxcu4Ny5c+O26ezsREhIiHM5ODgYnZ2dU9bGysrKQmpq6v8VomgIg4obGY1GCIKAlStXorW1dcIxdXV10Ol0WL16NYxG46T7stls0Ov1UCqVUCqVE45xOBxISkoaVbfb7cjOzsb27dsRFxcHvV6Prq4uZ/3evXvYvHkzYmNjkZWVBfuI2w+7qo2UnJwMjUYDrVaLhIQEvH792uXnQkRERORuCxYsQE1NDYqLi1FQUIDq6mo8ePAAAJCTk4OEhAQsXLjQLe9dUVEBX19fxMTEuGX/vwsGFTdSqVS4desWFi9ePOmY4R5MvV7vcl/DPZhFRUWTjrl58+aolA8M9Ua+ePECpaWlMJvNCA0NxdWrVwEAnz59Qn5+Pu7evYvq6mp8/PgRpaWlU9bGMhqNKC0thclkwuHDh3Hy5EmXx0JERETkbrNmzUJQUBAAICgoCHFxcWhpaQEANDc348qVKxAEAenp6WhtbUVcXByAoV9JOjo6nPvp7Ox0tqq5qo3U2NiI+vp6CIIAQRAAADt37sT79+/dc7BeikHFjaazB7OtrQ3l5eU4evTouJrNZsPAwADsdjssFgsWLVoEAKiqqkJsbCzmz58PqVSK+Ph4Z5+lq9pYI+fU29vLi8aIiIhoxn379g0/fvwAAFitVoiiiFWrVgEAzGYzRFGEKIrIzc1FWFgYzGYzAGDr1q0wmUzo7+9Hf38/TCYTtm3bNmVtpLNnz6K2ttb5HgBQVlaG0NDQ6Th0r8G7fnkBu92OzMxMnDlzZlzgEQQBjY2N2LhxI+RyOZYvX47Tp08DGN9nGRISMmkP5sjaRE6dOoVnz57B4XCgsLDwVx4eEREReZHs7GxUV1fj69evOHToEAICAlBeXo4jR44gNTUVa9asQVNTE9LT09Hb2wuHw4Hy8nIYDAZER0fjzp07+PLlC9LS0gAAe/bsQVdXF3p6erBp0yZER0fDYDCgubkZeXl5kEqlGBwcRExMDPbv3z/l/JRKJdRqNXbs2AEA0Gq1iIqKmrL25MkTiKIIg8Hgjo/tt8Sg4gVu3LiByMhIhIeHo729fVTt1atX+PDhA2pra+Hn5weDwYCLFy86w8qvMvxHaTKZkJOTg4KCgl+6fyIiIvIOmZmZyMzMHLd+5HeHiIgI1NbWTrj9vn37Ri0/fPhwwnFqtRpqtXrK+SiVylHPUAGAlJQUpKSkTDh+sppKpYJKpZpwm7dv3045DxqPrV9eoKmpCSUlJRAEAQkJCejp6YEgCOjt7UVJSQnWrVsHhUIBqVQKjUbjfBjS2D7Ljo6OSXswR9Zc0Wq1aGhoQHd39y8+SiIiIiL6nTCoeIHr16/j6dOnEEURt2/fxrx58yCKIubOnYslS5agvr7e2aNZU1PjfELrli1b8PjxY3z//h12ux3379939lm6qo1ksVhGtYSJogh/f38EBARMw5ETERERkbdi65cbTVcPpiuJiYl49+4dNBoNZDIZgoODcf78eQBDdxxLTk7G3r17AQAbNmyARqOZsvby5Uvk5eWhoKAAVqsVaWlpsFqtkEql8Pf3x7Vr13hBPRERERH9KxLH8JNviIiIiIiIPARbv4iIiIiIyOMwqBARERERkcdhUCEiIiIiIo/DoEJERERERB6HQYWIiIiIiDwOgwoREREREXkcBhUiIiIiIvI4DCpERERERORxGFSIiIiIiMjjMKgQEREREZHHYVAhIiIiIiKPw6BCREREREQeh0GFiIiIiIg8DoMKERERERF5HAYVIiIiIiLyOAwqRERERETkcf4LSgchDi8WFS8AAAAASUVORK5CYII=\n",
+ "text/plain": [
+ "