ml.adoc quickstart updated with Database Analytic Functions #137

krutik-2-11 · 2023-11-21T21:09:07Z

@Daniel-Itzul @adamtwo I have updated the ml.adoc quickstart with Vantage Database Analytic Functions. Please review.

github-actions · 2023-11-21T21:09:59Z

PR Preview Action v1.4.4
Preview removed because the pull request was closed.
2023-11-24 12:14 UTC

Daniel-Itzul · 2023-11-24T11:35:42Z

modules/ROOT/pages/ml.adoc

-[source, bash, role="content-editable"]
----
-scp ~/Downloads/VAL-2.0.0.3-1.x86_64.rpm [email protected]:/tmp/
+[source, teradata-sql]


Suggested change

[source, teradata-sql]

[source, teradata-sql]

Suggested change

[source, teradata-sql]

[source, teradata-sql, id="analytics-first-query", role="emits-gtm-events"]

Daniel-Itzul · 2023-11-24T11:44:10Z

modules/ROOT/pages/ml.adoc

+
+== Feature Engineering
+
+On looking at the data we see that there are several features that we can take into consideration for predicting the `cc_avg_bal`.


Suggested change

On looking at the data we see that there are several features that we can take into consideration for predicting the `cc_avg_bal`.

Feature engineering involves identifying specific attributes of the data with potential predictive capabilities, with relation to other traits we aim to model. It also encompasses the transformation of those values, in order they can be used for predictive modeling, this often involves numerical encoding, and scaling.

On looking at the data we see that there are several features that we can take into consideration for predicting the `cc_avg_bal`.

Daniel-Itzul · 2023-11-24T11:48:06Z

modules/ROOT/pages/ml.adoc

@@ -129,69 +91,183 @@ CREATE TABLE VAL_ADS AS (
 GROUP BY T1.cust_id) WITH DATA UNIQUE PRIMARY INDEX (cust_id);
 ----

-We will now build a linear regression model that takes parameters from the dataset and tries to predict the monthly credit card balance.
+Let's now see how our data looks. The dataset has both categorical and continuous features or independent variables. In our case, the dependent variable is `cc_avg_bal` which is customer's average credit card balance.


Suggested change

Let's now see how our data looks. The dataset has both categorical and continuous features or independent variables. In our case, the dependent variable is `cc_avg_bal` which is customer's average credit card balance.

Let's now see how our data looks. The dataset has both categorical and numerical attributes that can have predictive power on modeling the customer's average credit card balance `cc_avg_bal`. These predictive attributes, or independent variables we call features, the modeled dependent variable we call target.

Daniel-Itzul · 2023-11-24T11:50:05Z

modules/ROOT/pages/ml.adoc

+
+=== TD_OneHotEncodingFit
+
+As we have some categorical features in our dataset such as `gender`, `marital status` and `state code`. We will leverage the Database Analytics function link:https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Feature-Engineering-Transform-Functions/TD_OneHotEncodingFit[TD_OneHotEncodingFit, window="_blank"] to encode categories to one-hot numeric vectors. 


Suggested change

As we have some categorical features in our dataset such as `gender`, `marital status` and `state code`. We will leverage the Database Analytics function link:https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Feature-Engineering-Transform-Functions/TD_OneHotEncodingFit[TD_OneHotEncodingFit, window="_blank"] to encode categories to one-hot numeric vectors.

As we have some categorical features in our dataset such as `gender`, `marital status` and `state code`. We will leverage the Database Analytics function link:https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Feature-Engineering-Transform-Functions/TD_OneHotEncodingFit[TD_OneHotEncodingFit, window="_blank"] to encode these categories to one-hot numeric vectors.

Categories need to be transformed into numerical values to be used for modeling. There are several strategies for this; however, discussing how to choose the best method goes beyond the scope of this guide.

Daniel-Itzul · 2023-11-24T11:54:07Z

modules/ROOT/pages/ml.adoc

-The procedure creates several output tables. For now, we don't have to analyze what is in the tables. Let's see how we can use the newly created model to perform scoring.
+=== TD_ScaleFit
+
+If we look at the data, some columns like `tot_income`, `tot_age`, `ck_avg_bal` have values in different ranges. For the optimization algorithms like gradient descent it is important to normalize the values to the same scale for faster convergence, scale consistency and enhanced model performance. We will leverage link:https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Feature-Engineering-Transform-Functions/TD_ScaleFit[TD_ScaleFit, window="_blank"] function to normalize values in different scales.


Suggested change

If we look at the data, some columns like `tot_income`, `tot_age`, `ck_avg_bal` have values in different ranges. For the optimization algorithms like gradient descent it is important to normalize the values to the same scale for faster convergence, scale consistency and enhanced model performance. We will leverage link:https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Feature-Engineering-Transform-Functions/TD_ScaleFit[TD_ScaleFit, window="_blank"] function to normalize values in different scales.

If we look at the data, some columns like `tot_income`, `tot_age`, `ck_avg_bal` have values in different ranges. Certain machine learning algorithms, especially those based on clustering, require the data to be on a uniform scale. The method we will use for modeling doesn't have this constraint; however, it might still be recommended to scale the data. This precaution helps to prevent certain features from appearing to have a greater correlation value than they actually possess in relation to others.

We will leverage link:https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Feature-Engineering-Transform-Functions/TD_ScaleFit[TD_ScaleFit, window="_blank"] function to normalize values in different scales.

Daniel-Itzul · 2023-11-24T12:05:24Z

modules/ROOT/pages/ml.adoc

+
+== Training with Generalized Linear Model 
+
+We will now use link:https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Model-Training-Functions/TD_GLM[TD_GLM, window="_blank"] Database Analytic Function to train on our training dataset. The `TD_GLM` function is a generalized linear model (GLM) that performs regression and classification analysis on data sets. Here we have used a bunch of input columns such as `tot_income`, `ck_avg_bal`,`cc_avg_tran_amt`, one-hot encoded values for marital status, gender and states. `cc_avg_bal` is our dependent or response column which is continous and hence is a regression problem. We use `Family` as `Gaussian` for regression and `Binomial` for classification. 


Suggested change

We will now use link:https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Model-Training-Functions/TD_GLM[TD_GLM, window="_blank"] Database Analytic Function to train on our training dataset. The `TD_GLM` function is a generalized linear model (GLM) that performs regression and classification analysis on data sets. Here we have used a bunch of input columns such as `tot_income`, `ck_avg_bal`,`cc_avg_tran_amt`, one-hot encoded values for marital status, gender and states. `cc_avg_bal` is our dependent or response column which is continous and hence is a regression problem. We use `Family` as `Gaussian` for regression and `Binomial` for classification.

We will now use link:https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Model-Training-Functions/TD_GLM[TD_GLM, window="_blank"] Database Analytic Function to train on our training dataset. The `TD_GLM` function is a generalized linear model (GLM) that performs regression and classification analysis on data sets. Here we have used a set of input columns such as `tot_income`, `ck_avg_bal`,`cc_avg_tran_amt`, one-hot encoded values for marital status, gender, and states. `cc_avg_bal` is our dependent or response column. We use `Family` as `Gaussian` for regression and `Binomial` for classification.

Daniel-Itzul · 2023-11-24T12:06:32Z

modules/ROOT/pages/ml.adoc

+
+We will now use link:https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Model-Training-Functions/TD_GLM[TD_GLM, window="_blank"] Database Analytic Function to train on our training dataset. The `TD_GLM` function is a generalized linear model (GLM) that performs regression and classification analysis on data sets. Here we have used a bunch of input columns such as `tot_income`, `ck_avg_bal`,`cc_avg_tran_amt`, one-hot encoded values for marital status, gender and states. `cc_avg_bal` is our dependent or response column which is continous and hence is a regression problem. We use `Family` as `Gaussian` for regression and `Binomial` for classification. 
+
+The parameter `Tolerance` signifies minimum improvement required in prediction accuracy for model to stop the iterations and `MaxIterNum` signifies the maximum number of iterations allowed. The model concludes training upon whichever condition is met first. For example in the example below the model is `CONVERGED` after 58 iterations.


Suggested change

The parameter `Tolerance` signifies minimum improvement required in prediction accuracy for model to stop the iterations and `MaxIterNum` signifies the maximum number of iterations allowed. The model concludes training upon whichever condition is met first. For example in the example below the model is `CONVERGED` after 58 iterations.

The parameter `Tolerance` signifies the minimum improvement required in prediction accuracy for the model to stop iterating and `MaxIterNum` signifies the maximum number of iterations allowed. The model concludes training upon whichever condition is met first. For example in the example below the model `CONVERGED` it ran 58 iterations.

Daniel-Itzul · 2023-11-24T12:07:08Z

modules/ROOT/pages/ml.adoc

+
+== Model Evaluation
+
+Finally, we evaluate our model on the scored results. Here we are using link:https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Model-Evaluation-Functions/TD_RegressionEvaluator[TD_RegressionEvaluator, window="_blank"] function. The model can be evaluated based on parameters such as `R2`, `RMSE`, `F_score`. 


Suggested change

Finally, we evaluate our model on the scored results. Here we are using link:https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Model-Evaluation-Functions/TD_RegressionEvaluator[TD_RegressionEvaluator, window="_blank"] function. The model can be evaluated based on parameters such as `R2`, `RMSE`, `F_score`.

Finally, we evaluate our model on the scored results. Here we are using the link:https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Model-Evaluation-Functions/TD_RegressionEvaluator[TD_RegressionEvaluator, window="_blank"] function. The model can be evaluated based on parameters such as `R2`, `RMSE`, `F_score`.

Daniel-Itzul · 2023-11-24T12:09:31Z

modules/ROOT/pages/ml.adoc

 ----

+image::ml_model_evaluated.png[Evaluated GLM, width=100%]
+
+NOTE: The purpose of this how-to is not to describe feature engineering but to demonstrate how we can leverage different Database Analytic Functions in Vantage. The model results might not be optimal and the process to make the best model is beyond the scope of this article.


Suggested change

NOTE: The purpose of this how-to is not to describe feature engineering but to demonstrate how we can leverage different Database Analytic Functions in Vantage. The model results might not be optimal and the process to make the best model is beyond the scope of this article.

NOTE: The purpose of this how-to is to showcase how Teradata's Analytics Functions simplify feature engineering and modeling. The model results might not be optimal, the discussion on how improve the model through choosing or refining features is beyond the scope of this guide.

Daniel-Itzul · 2023-11-24T12:10:21Z

modules/ROOT/pages/ml.adoc

 == Summary

-In this quick start we have learned how to create ML models in SQL. The method used Vantage Analytics Library (VAL). We were able to build a linear regression model and run predictions using the model. We have done that using SQL without any coding.
+In this quick start we have learned how to create ML models using Teradata Database Analytic Functions. We built our own database `td_analytics_functions_demo` with `customer`,`accounts`, `transactions` data from `val` database. We performed feature engineering by transforming the columns using `TD_OneHotEncodingFit`, `TD_ScaleFit` and `TD_ColumnTransformer`. We then used `TD_TrainTestSplit` for train test split. We trained our training dataset with `TD_GLM` model and scored our testing dataset. Finally we evaluated our scored results using `TD_RegressionEvaluator` function. 


Suggested change

In this quick start we have learned how to create ML models using Teradata Database Analytic Functions. We built our own database `td_analytics_functions_demo` with `customer`,`accounts`, `transactions` data from `val` database. We performed feature engineering by transforming the columns using `TD_OneHotEncodingFit`, `TD_ScaleFit` and `TD_ColumnTransformer`. We then used `TD_TrainTestSplit` for train test split. We trained our training dataset with `TD_GLM` model and scored our testing dataset. Finally we evaluated our scored results using `TD_RegressionEvaluator` function.

In this quick start we have learned how to create ML models using Teradata Database Analytic Functions. We built our own database `td_analytics_functions_demo` with `customer`,`accounts`, `transactions` data. We performed feature engineering by transforming the columns using `TD_OneHotEncodingFit`, `TD_ScaleFit` and `TD_ColumnTransformer`. We then used `TD_TrainTestSplit` for train test split. We trained our training dataset with `TD_GLM` model and scored our testing dataset. Finally we evaluated our scored results using `TD_RegressionEvaluator` function.

ml.adoc quickstart updated with Database Analytic Functions

2377a36

Daniel-Itzul approved these changes Nov 24, 2023

View reviewed changes

Daniel-Itzul merged commit 91c0990 into main Nov 24, 2023
1 check passed

Daniel-Itzul deleted the krutik_ml_model branch November 24, 2023 12:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ml.adoc quickstart updated with Database Analytic Functions #137

ml.adoc quickstart updated with Database Analytic Functions #137

krutik-2-11 commented Nov 21, 2023

github-actions bot commented Nov 21, 2023 •

edited

Loading

Daniel-Itzul Nov 24, 2023

Daniel-Itzul Nov 24, 2023

Daniel-Itzul Nov 24, 2023

Daniel-Itzul Nov 24, 2023

Daniel-Itzul Nov 24, 2023

Daniel-Itzul Nov 24, 2023

Daniel-Itzul Nov 24, 2023

Daniel-Itzul Nov 24, 2023

Daniel-Itzul Nov 24, 2023

Daniel-Itzul Nov 24, 2023

Daniel-Itzul Nov 24, 2023

	[source, teradata-sql]
	[source, teradata-sql, id="analytics-first-query", role="emits-gtm-events"]


		== Feature Engineering

		On looking at the data we see that there are several features that we can take into consideration for predicting the `cc_avg_bal`.

	On looking at the data we see that there are several features that we can take into consideration for predicting the `cc_avg_bal`.
	Feature engineering involves identifying specific attributes of the data with potential predictive capabilities, with relation to other traits we aim to model. It also encompasses the transformation of those values, in order they can be used for predictive modeling, this often involves numerical encoding, and scaling.
	On looking at the data we see that there are several features that we can take into consideration for predicting the `cc_avg_bal`.

	Let's now see how our data looks. The dataset has both categorical and continuous features or independent variables. In our case, the dependent variable is `cc_avg_bal` which is customer's average credit card balance.
	Let's now see how our data looks. The dataset has both categorical and numerical attributes that can have predictive power on modeling the customer's average credit card balance `cc_avg_bal`. These predictive attributes, or independent variables we call features, the modeled dependent variable we call target.


		=== TD_OneHotEncodingFit

		As we have some categorical features in our dataset such as `gender`, `marital status` and `state code`. We will leverage the Database Analytics function link:https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Feature-Engineering-Transform-Functions/TD_OneHotEncodingFit[TD_OneHotEncodingFit, window="_blank"] to encode categories to one-hot numeric vectors.

	As we have some categorical features in our dataset such as `gender`, `marital status` and `state code`. We will leverage the Database Analytics function link:https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Feature-Engineering-Transform-Functions/TD_OneHotEncodingFit[TD_OneHotEncodingFit, window="_blank"] to encode categories to one-hot numeric vectors.
	As we have some categorical features in our dataset such as `gender`, `marital status` and `state code`. We will leverage the Database Analytics function link:https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Feature-Engineering-Transform-Functions/TD_OneHotEncodingFit[TD_OneHotEncodingFit, window="_blank"] to encode these categories to one-hot numeric vectors.
	Categories need to be transformed into numerical values to be used for modeling. There are several strategies for this; however, discussing how to choose the best method goes beyond the scope of this guide.

	If we look at the data, some columns like `tot_income`, `tot_age`, `ck_avg_bal` have values in different ranges. For the optimization algorithms like gradient descent it is important to normalize the values to the same scale for faster convergence, scale consistency and enhanced model performance. We will leverage link:https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Feature-Engineering-Transform-Functions/TD_ScaleFit[TD_ScaleFit, window="_blank"] function to normalize values in different scales.
	If we look at the data, some columns like `tot_income`, `tot_age`, `ck_avg_bal` have values in different ranges. Certain machine learning algorithms, especially those based on clustering, require the data to be on a uniform scale. The method we will use for modeling doesn't have this constraint; however, it might still be recommended to scale the data. This precaution helps to prevent certain features from appearing to have a greater correlation value than they actually possess in relation to others.
	We will leverage link:https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Feature-Engineering-Transform-Functions/TD_ScaleFit[TD_ScaleFit, window="_blank"] function to normalize values in different scales.


		== Training with Generalized Linear Model

		We will now use link:https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Model-Training-Functions/TD_GLM[TD_GLM, window="_blank"] Database Analytic Function to train on our training dataset. The `TD_GLM` function is a generalized linear model (GLM) that performs regression and classification analysis on data sets. Here we have used a bunch of input columns such as `tot_income`, `ck_avg_bal`,`cc_avg_tran_amt`, one-hot encoded values for marital status, gender and states. `cc_avg_bal` is our dependent or response column which is continous and hence is a regression problem. We use `Family` as `Gaussian` for regression and `Binomial` for classification.


		We will now use link:https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Model-Training-Functions/TD_GLM[TD_GLM, window="_blank"] Database Analytic Function to train on our training dataset. The `TD_GLM` function is a generalized linear model (GLM) that performs regression and classification analysis on data sets. Here we have used a bunch of input columns such as `tot_income`, `ck_avg_bal`,`cc_avg_tran_amt`, one-hot encoded values for marital status, gender and states. `cc_avg_bal` is our dependent or response column which is continous and hence is a regression problem. We use `Family` as `Gaussian` for regression and `Binomial` for classification.

		The parameter `Tolerance` signifies minimum improvement required in prediction accuracy for model to stop the iterations and `MaxIterNum` signifies the maximum number of iterations allowed. The model concludes training upon whichever condition is met first. For example in the example below the model is `CONVERGED` after 58 iterations.

	The parameter `Tolerance` signifies minimum improvement required in prediction accuracy for model to stop the iterations and `MaxIterNum` signifies the maximum number of iterations allowed. The model concludes training upon whichever condition is met first. For example in the example below the model is `CONVERGED` after 58 iterations.
	The parameter `Tolerance` signifies the minimum improvement required in prediction accuracy for the model to stop iterating and `MaxIterNum` signifies the maximum number of iterations allowed. The model concludes training upon whichever condition is met first. For example in the example below the model `CONVERGED` it ran 58 iterations.


		== Model Evaluation

		Finally, we evaluate our model on the scored results. Here we are using link:https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Model-Evaluation-Functions/TD_RegressionEvaluator[TD_RegressionEvaluator, window="_blank"] function. The model can be evaluated based on parameters such as `R2`, `RMSE`, `F_score`.

	NOTE: The purpose of this how-to is not to describe feature engineering but to demonstrate how we can leverage different Database Analytic Functions in Vantage. The model results might not be optimal and the process to make the best model is beyond the scope of this article.
	NOTE: The purpose of this how-to is to showcase how Teradata's Analytics Functions simplify feature engineering and modeling. The model results might not be optimal, the discussion on how improve the model through choosing or refining features is beyond the scope of this guide.

ml.adoc quickstart updated with Database Analytic Functions #137

ml.adoc quickstart updated with Database Analytic Functions #137

Conversation

krutik-2-11 commented Nov 21, 2023

github-actions bot commented Nov 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Nov 21, 2023 •

edited

Loading