Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sphinx Docstring Update for MachineLearning/Vertica/LinearModel #753

Merged
merged 4 commits into from
Oct 21, 2023
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
360 changes: 360 additions & 0 deletions verticapy/machine_learning/vertica/linear_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -494,6 +494,366 @@ class LinearRegression(Regressor, LinearModel):
used in training the model. Note that setting
fit_intercept to false does not work well with the
BFGS optimizer.

Examples
---------

The following examples provide a basic understanding of usage. For more
detailed examples, please refer to the :ref:`user_guide.machine_learning` or the "Examples"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add link to the example page of the VerticaPy Website.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also PEP, try to go back to the next line after the ref.

section on the website.

Load data for machine learning
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

We import ``verticapy``:

.. code-block:: python

import verticapy as vp

.. hint::

By assigning an alias to ``verticapy``, we mitigate the risk of code
collisions with other libraries. This precaution is necessary
because verticapy uses commonly known function names like "average"
and "median," which can potentially lead to naming conflicts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comma inside '"' move it out.

The use of an alias ensures that the functions from verticapy are
used as intended without interfering with functions from other
libraries.

For this example, we will use the winequality dataset.

.. code-block:: python

import verticapy.datasets as vpd

data = vpd.load_winequality()

.. raw:: html
:file: /project/data/VerticaPy/docs/figures/datasets_loaders_load_winequality.html

.. note::

VerticaPy offers a wide range of sample datasets that are ideal for training
and testing purposes. You can explore the full list of available datasets in
the :ref:`api.datasets`, which provides detailed information on each dataset and how to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP go back to line after dataset.

use them effectively. These datasets are invaluable resources for honing your
data analysis and machine learning skills within the VerticaPy environment.

You can easily divide your dataset into training and testing subsets using the
:py:mod:`vDataFrame.train_test_split` method. This is a crucial step when preparing
your data for machine learning, as it allows you to evaluate the performance of
your models accurately.

.. code-block:: python

data = vpd.load_winequality()
train, test = data.train_test_split(test_size = 0.2)

.. warning::

In this case, VerticaPy utilizes seeded randomization to guarantee the
reproducibility of your data split. However, please be aware that this
approach may lead to reduced performance. For a more efficient data
split, you can use the :py:mod:`vDataFrame.to_db` method to save your results
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

into ``tables`` or ``temporary tables``. This will help enhance the
overall performance of the process.

.. ipython:: python
:suppress:

import verticapy as vp
import verticapy.datasets as vpd
data = vpd.load_winequality()
train, test = data.train_test_split(test_size = 0.2)

Model Initialization
^^^^^^^^^^^^^^^^^^^^^

First we import the ``LinearRegression`` model:

.. code-block::

from verticapy.machine_learning.vertica import LinearRegression

Then we can create the model:

.. code-block::

model = LinearRegression(
tol = 1e-6,
max_iter = 100,
solver = 'Newton',
fit_intercept = True,
)

.. hint::

In ``verticapy`` 1.0.x and higher, you do not need to specify the
model name, as the name is automatically assigned. If you need to
re-use the model, you can fetch the model name from the model's
attributes.

.. important::

The model name is crucial for the model management system and
versioning. It's highly recommended to provide a name if you
plan to reuse the model later.

.. ipython:: python
:suppress:

from verticapy.machine_learning.vertica import LinearRegression
model = LinearRegression(
tol = 1e-6,
max_iter = 100,
solver = 'Newton',
fit_intercept = True,
)

Model Training
^^^^^^^^^^^^^^^

We can now fit the model:

.. ipython:: python

model.fit(
train,
[
"fixed_acidity",
"volatile_acidity",
"citric_acid",
"residual_sugar",
"chlorides",
"density"
],
"quality",
test,
)

.. important::

To train a model, you can directly use the ``vDataFrame`` or the
name of the relation stored in the database. The test set is optional
and is only used to compute the test metrics. In ``verticapy``, we
don't work using ``X`` matrices and ``y`` vectors. Instead, we work
directly with lists of predictors and the response name.

Features Importance
^^^^^^^^^^^^^^^^^^^^

We can conveniently get the features importance:

.. ipython:: python
:suppress:

vp.set_option("plotting_lib", "plotly")
fig = model.features_importance()
fig.write_html("/project/data/VerticaPy/docs/figures/machine_learning_vertica_linear_model_lr_feature.html")

.. code-block:: python

result = model.features_importance()

.. raw:: html
:file: /project/data/VerticaPy/docs/figures/machine_learning_vertica_linear_model_lr_feature.html

.. note::

For ``LinearModel``, feature importance is computed using the coefficients.
These coefficients are then normalized using the feature distribution. An
activation function is applied to get the final score.

Metrics
^^^^^^^^

We can get the entire report using:

.. ipython:: python
:suppress:

result = model.report()
html_file = open("/project/data/VerticaPy/docs/figures/machine_learning_vertica_linear_model_lr_report.html", "w")
html_file.write(result._repr_html_())
html_file.close()

.. code-block:: python

result = model.report()

.. raw:: html
:file: /project/data/VerticaPy/docs/figures/machine_learning_vertica_linear_model_lr_report.html

.. important::

Most metrics are computed using a single SQL query, but some of them might
require multiple SQL queries. Selecting only the necessary metrics in the
report can help optimize performance.
E.g. ``model.report(metrics = ["mse", "r2"])``.

For ``LinearModel``, we can easily get the ANOVA table using:

.. ipython:: python
:suppress:

result = model.report(metrics = "anova")
html_file = open("/project/data/VerticaPy/docs/figures/machine_learning_vertica_linear_model_lr_report_anova.html", "w")
html_file.write(result._repr_html_())
html_file.close()

.. code-block:: python

result = model.report(metrics = "anova")

.. raw:: html
:file: /project/data/VerticaPy/docs/figures/machine_learning_vertica_linear_model_lr_report_anova.html

You can also use the ``LinearModel.score`` function to compute the R-squared
value:

.. ipython:: python

model.score()

Prediction
^^^^^^^^^^^

Prediction is straight-forward:

.. ipython:: python
:suppress:

result = model.predict(
test,
[
"fixed_acidity",
"volatile_acidity",
"citric_acid",
"residual_sugar",
"chlorides",
"density"
],
"prediction",
)
html_file = open("figures/machine_learning_vertica_linear_model_lr_prediction.html", "w")
html_file.write(result._repr_html_())
html_file.close()

.. code-block:: python

model.predict(
test,
[
"fixed_acidity",
"volatile_acidity",
"citric_acid",
"residual_sugar",
"chlorides",
"density"
],
"prediction",
)

.. raw:: html
:file: /project/data/VerticaPy/docs/figures/machine_learning_vertica_linear_model_lr_prediction.html

.. note::

Predictions can be made automatically using the test set, in which
case you don't need to specify the predictors. Alternatively, you
can pass only the ``vDataFrame`` to the
:py:mod:`verticapy.machine_learning.vertica.linear_model.LinearModel.predict`
function, but in this case, it's essential that the column names of
the ``vDataFrame`` match the predictors and response name in the
model.

Plots
^^^^^^

If the model allows, you can also generate relevant plots. For example,
regression plots can be found in the :ref:`chart_gallery.regression_plot`.

.. code-block:: python

model.plot()

.. important::

The plotting feature is typically suitable for models with fewer than
three predictors.

Parameter Modification
^^^^^^^^^^^^^^^^^^^^^^^

In order to see the parameters:

.. ipython:: python

model.get_params()

And to manually change some of the parameters:

.. ipython:: python

model.set_params({'tol': 0.001})

Model Register
^^^^^^^^^^^^^^

In order to register the model for tracking and versioning:

.. code-block:: python

model.register("model_v1")

Please refer to :ref:`notebooks/ml/model_tracking_versioning/index.html` for
more details on model tracking and versioning.

Model Exporting
^^^^^^^^^^^^^^^^

**To Memmodel**

.. code-block:: python

model.to_memmodel()

.. note::

``MemModel`` objects serve as in-memory representations of machine
learning models. They can be used for both in-database and in-memory
prediction tasks. These objects can be pickled in the same way that
you would pickle a ``scikit-learn`` model.

The following methods for exporting the model use ``MemModel``, and it
is recommended to use ``MemModel`` directly.

**To SQL**

You can get the SQL code by:

.. ipython:: python

model.to_sql()

**To Python**

To obtain the prediction function in Python syntax, use the following code:

.. ipython:: python

X = [[4.2, 0.17, 0.36, 1.8, 0.029, 0.9899]]
model.to_python()(X)

.. hint::

The
:py:mod:`verticapy.machine_learning.vertica.linear_model.LinearModel.to_python`
method is used to retrieve predictions,
probabilities, or cluster distances. For specific details on how to
use this method for different model types, refer to the relevant
documentation for each model.
"""

# Properties.
Expand Down