Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ml.adoc quickstart updated with Database Analytic Functions #137

Merged
merged 1 commit into from
Nov 24, 2023

Conversation

krutik-2-11
Copy link
Contributor

@Daniel-Itzul @adamtwo I have updated the ml.adoc quickstart with Vantage Database Analytic Functions. Please review.

Copy link

github-actions bot commented Nov 21, 2023

PR Preview Action v1.4.4
Preview removed because the pull request was closed.
2023-11-24 12:14 UTC

[source, bash, role="content-editable"]
----
scp ~/Downloads/VAL-2.0.0.3-1.x86_64.rpm [email protected]:/tmp/
[source, teradata-sql]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[source, teradata-sql]
[source, teradata-sql]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[source, teradata-sql]
[source, teradata-sql, id="analytics-first-query", role="emits-gtm-events"]


== Feature Engineering

On looking at the data we see that there are several features that we can take into consideration for predicting the `cc_avg_bal`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
On looking at the data we see that there are several features that we can take into consideration for predicting the `cc_avg_bal`.
Feature engineering involves identifying specific attributes of the data with potential predictive capabilities, with relation to other traits we aim to model. It also encompasses the transformation of those values, in order they can be used for predictive modeling, this often involves numerical encoding, and scaling.
On looking at the data we see that there are several features that we can take into consideration for predicting the `cc_avg_bal`.

@@ -129,69 +91,183 @@ CREATE TABLE VAL_ADS AS (
GROUP BY T1.cust_id) WITH DATA UNIQUE PRIMARY INDEX (cust_id);
----

We will now build a linear regression model that takes parameters from the dataset and tries to predict the monthly credit card balance.
Let's now see how our data looks. The dataset has both categorical and continuous features or independent variables. In our case, the dependent variable is `cc_avg_bal` which is customer's average credit card balance.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Let's now see how our data looks. The dataset has both categorical and continuous features or independent variables. In our case, the dependent variable is `cc_avg_bal` which is customer's average credit card balance.
Let's now see how our data looks. The dataset has both categorical and numerical attributes that can have predictive power on modeling the customer's average credit card balance `cc_avg_bal`. These predictive attributes, or independent variables we call features, the modeled dependent variable we call target.


=== TD_OneHotEncodingFit

As we have some categorical features in our dataset such as `gender`, `marital status` and `state code`. We will leverage the Database Analytics function link:https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Feature-Engineering-Transform-Functions/TD_OneHotEncodingFit[TD_OneHotEncodingFit, window="_blank"] to encode categories to one-hot numeric vectors.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
As we have some categorical features in our dataset such as `gender`, `marital status` and `state code`. We will leverage the Database Analytics function link:https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Feature-Engineering-Transform-Functions/TD_OneHotEncodingFit[TD_OneHotEncodingFit, window="_blank"] to encode categories to one-hot numeric vectors.
As we have some categorical features in our dataset such as `gender`, `marital status` and `state code`. We will leverage the Database Analytics function link:https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Feature-Engineering-Transform-Functions/TD_OneHotEncodingFit[TD_OneHotEncodingFit, window="_blank"] to encode these categories to one-hot numeric vectors.
Categories need to be transformed into numerical values to be used for modeling. There are several strategies for this; however, discussing how to choose the best method goes beyond the scope of this guide.

The procedure creates several output tables. For now, we don't have to analyze what is in the tables. Let's see how we can use the newly created model to perform scoring.
=== TD_ScaleFit

If we look at the data, some columns like `tot_income`, `tot_age`, `ck_avg_bal` have values in different ranges. For the optimization algorithms like gradient descent it is important to normalize the values to the same scale for faster convergence, scale consistency and enhanced model performance. We will leverage link:https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Feature-Engineering-Transform-Functions/TD_ScaleFit[TD_ScaleFit, window="_blank"] function to normalize values in different scales.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If we look at the data, some columns like `tot_income`, `tot_age`, `ck_avg_bal` have values in different ranges. For the optimization algorithms like gradient descent it is important to normalize the values to the same scale for faster convergence, scale consistency and enhanced model performance. We will leverage link:https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Feature-Engineering-Transform-Functions/TD_ScaleFit[TD_ScaleFit, window="_blank"] function to normalize values in different scales.
If we look at the data, some columns like `tot_income`, `tot_age`, `ck_avg_bal` have values in different ranges. Certain machine learning algorithms, especially those based on clustering, require the data to be on a uniform scale. The method we will use for modeling doesn't have this constraint; however, it might still be recommended to scale the data. This precaution helps to prevent certain features from appearing to have a greater correlation value than they actually possess in relation to others.
We will leverage link:https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Feature-Engineering-Transform-Functions/TD_ScaleFit[TD_ScaleFit, window="_blank"] function to normalize values in different scales.


== Training with Generalized Linear Model

We will now use link:https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Model-Training-Functions/TD_GLM[TD_GLM, window="_blank"] Database Analytic Function to train on our training dataset. The `TD_GLM` function is a generalized linear model (GLM) that performs regression and classification analysis on data sets. Here we have used a bunch of input columns such as `tot_income`, `ck_avg_bal`,`cc_avg_tran_amt`, one-hot encoded values for marital status, gender and states. `cc_avg_bal` is our dependent or response column which is continous and hence is a regression problem. We use `Family` as `Gaussian` for regression and `Binomial` for classification.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We will now use link:https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Model-Training-Functions/TD_GLM[TD_GLM, window="_blank"] Database Analytic Function to train on our training dataset. The `TD_GLM` function is a generalized linear model (GLM) that performs regression and classification analysis on data sets. Here we have used a bunch of input columns such as `tot_income`, `ck_avg_bal`,`cc_avg_tran_amt`, one-hot encoded values for marital status, gender and states. `cc_avg_bal` is our dependent or response column which is continous and hence is a regression problem. We use `Family` as `Gaussian` for regression and `Binomial` for classification.
We will now use link:https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Model-Training-Functions/TD_GLM[TD_GLM, window="_blank"] Database Analytic Function to train on our training dataset. The `TD_GLM` function is a generalized linear model (GLM) that performs regression and classification analysis on data sets. Here we have used a set of input columns such as `tot_income`, `ck_avg_bal`,`cc_avg_tran_amt`, one-hot encoded values for marital status, gender, and states. `cc_avg_bal` is our dependent or response column. We use `Family` as `Gaussian` for regression and `Binomial` for classification.


We will now use link:https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Model-Training-Functions/TD_GLM[TD_GLM, window="_blank"] Database Analytic Function to train on our training dataset. The `TD_GLM` function is a generalized linear model (GLM) that performs regression and classification analysis on data sets. Here we have used a bunch of input columns such as `tot_income`, `ck_avg_bal`,`cc_avg_tran_amt`, one-hot encoded values for marital status, gender and states. `cc_avg_bal` is our dependent or response column which is continous and hence is a regression problem. We use `Family` as `Gaussian` for regression and `Binomial` for classification.

The parameter `Tolerance` signifies minimum improvement required in prediction accuracy for model to stop the iterations and `MaxIterNum` signifies the maximum number of iterations allowed. The model concludes training upon whichever condition is met first. For example in the example below the model is `CONVERGED` after 58 iterations.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The parameter `Tolerance` signifies minimum improvement required in prediction accuracy for model to stop the iterations and `MaxIterNum` signifies the maximum number of iterations allowed. The model concludes training upon whichever condition is met first. For example in the example below the model is `CONVERGED` after 58 iterations.
The parameter `Tolerance` signifies the minimum improvement required in prediction accuracy for the model to stop iterating and `MaxIterNum` signifies the maximum number of iterations allowed. The model concludes training upon whichever condition is met first. For example in the example below the model `CONVERGED` it ran 58 iterations.


== Model Evaluation

Finally, we evaluate our model on the scored results. Here we are using link:https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Model-Evaluation-Functions/TD_RegressionEvaluator[TD_RegressionEvaluator, window="_blank"] function. The model can be evaluated based on parameters such as `R2`, `RMSE`, `F_score`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Finally, we evaluate our model on the scored results. Here we are using link:https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Model-Evaluation-Functions/TD_RegressionEvaluator[TD_RegressionEvaluator, window="_blank"] function. The model can be evaluated based on parameters such as `R2`, `RMSE`, `F_score`.
Finally, we evaluate our model on the scored results. Here we are using the link:https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Model-Evaluation-Functions/TD_RegressionEvaluator[TD_RegressionEvaluator, window="_blank"] function. The model can be evaluated based on parameters such as `R2`, `RMSE`, `F_score`.

----

image::ml_model_evaluated.png[Evaluated GLM, width=100%]

NOTE: The purpose of this how-to is not to describe feature engineering but to demonstrate how we can leverage different Database Analytic Functions in Vantage. The model results might not be optimal and the process to make the best model is beyond the scope of this article.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
NOTE: The purpose of this how-to is not to describe feature engineering but to demonstrate how we can leverage different Database Analytic Functions in Vantage. The model results might not be optimal and the process to make the best model is beyond the scope of this article.
NOTE: The purpose of this how-to is to showcase how Teradata's Analytics Functions simplify feature engineering and modeling. The model results might not be optimal, the discussion on how improve the model through choosing or refining features is beyond the scope of this guide.

== Summary

In this quick start we have learned how to create ML models in SQL. The method used Vantage Analytics Library (VAL). We were able to build a linear regression model and run predictions using the model. We have done that using SQL without any coding.
In this quick start we have learned how to create ML models using Teradata Database Analytic Functions. We built our own database `td_analytics_functions_demo` with `customer`,`accounts`, `transactions` data from `val` database. We performed feature engineering by transforming the columns using `TD_OneHotEncodingFit`, `TD_ScaleFit` and `TD_ColumnTransformer`. We then used `TD_TrainTestSplit` for train test split. We trained our training dataset with `TD_GLM` model and scored our testing dataset. Finally we evaluated our scored results using `TD_RegressionEvaluator` function.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In this quick start we have learned how to create ML models using Teradata Database Analytic Functions. We built our own database `td_analytics_functions_demo` with `customer`,`accounts`, `transactions` data from `val` database. We performed feature engineering by transforming the columns using `TD_OneHotEncodingFit`, `TD_ScaleFit` and `TD_ColumnTransformer`. We then used `TD_TrainTestSplit` for train test split. We trained our training dataset with `TD_GLM` model and scored our testing dataset. Finally we evaluated our scored results using `TD_RegressionEvaluator` function.
In this quick start we have learned how to create ML models using Teradata Database Analytic Functions. We built our own database `td_analytics_functions_demo` with `customer`,`accounts`, `transactions` data. We performed feature engineering by transforming the columns using `TD_OneHotEncodingFit`, `TD_ScaleFit` and `TD_ColumnTransformer`. We then used `TD_TrainTestSplit` for train test split. We trained our training dataset with `TD_GLM` model and scored our testing dataset. Finally we evaluated our scored results using `TD_RegressionEvaluator` function.

@Daniel-Itzul Daniel-Itzul merged commit 91c0990 into main Nov 24, 2023
1 check passed
@Daniel-Itzul Daniel-Itzul deleted the krutik_ml_model branch November 24, 2023 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants