Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix _ Isolation Forest Anomaly Plot #762

Merged
merged 4 commits into from
Oct 24, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion verticapy/machine_learning/vertica/cluster.py
Original file line number Diff line number Diff line change
Expand Up @@ -176,7 +176,7 @@ def plot(
**style_kwargs,
}
if self._model_subcategory == "ANOMALY_DETECTION":
fun = vdf.bubble
fun = vdf.scatter
name = "anomaly_score"
kwargs["cmap_col"] = name
else:
Expand Down
308 changes: 308 additions & 0 deletions verticapy/machine_learning/vertica/ensemble.py
Original file line number Diff line number Diff line change
Expand Up @@ -1087,6 +1087,314 @@ class IsolationForest(Clustering, Tree):
Float in the range (0,1] that specifies the
fraction of columns (features), chosen at random,
used when building each tree.

Examples
---------

The following examples provide a basic understanding of usage.
For more detailed examples, please refer to the
:ref:`user_guide.machine_learning` or the
`Examples <https://www.vertica.com/python/examples/>`_
section on the website.

Load data for machine learning
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

We import ``verticapy``:

.. ipython:: python

import verticapy as vp

.. hint::

By assigning an alias to ``verticapy``, we mitigate the risk of code
collisions with other libraries. This precaution is necessary
because verticapy uses commonly known function names like "average"
and "median", which can potentially lead to naming conflicts.
The use of an alias ensures that the functions from verticapy are
used as intended without interfering with functions from other
libraries.

For this example, we will use the winequality dataset.

.. code-block:: python

import verticapy.datasets as vpd

data = vpd.load_winequality()

.. raw:: html
:file: SPHINX_DIRECTORY/figures/datasets_loaders_load_winequality.html

.. note::

VerticaPy offers a wide range of sample datasets that are
ideal for training and testing purposes. You can explore
the full list of available datasets in the :ref:`api.datasets`,
which provides detailed information on each dataset
and how to use them effectively. These datasets are invaluable
resources for honing your data analysis and machine learning
skills within the VerticaPy environment.

.. ipython:: python
:suppress:

import verticapy.datasets as vpd
data = vpd.load_winequality()

Model Initialization
^^^^^^^^^^^^^^^^^^^^^

First we import the ``IsolationForest`` model:

.. code-block::

from verticapy.machine_learning.vertica import IsolationForest

.. ipython:: python
:suppress:

from verticapy.machine_learning.vertica import IsolationForest

Then we can create the model:

.. ipython:: python
:okwarning:

model = IsolationForest(
n_estimators = 10,
max_depth = 3,
nbins = 6,
)

.. hint::

In ``verticapy`` 1.0.x and higher, you do not need to specify the
model name, as the name is automatically assigned. If you need to
re-use the model, you can fetch the model name from the model's
attributes.

.. important::

The model name is crucial for the model management system and
versioning. It's highly recommended to provide a name if you
plan to reuse the model later.

Model Training
^^^^^^^^^^^^^^^

We can now fit the model:

.. ipython:: python
:okwarning:

model.fit(data, X = ["density", "sulphates"])

.. important::

To train a model, you can directly use the ``vDataFrame`` or the
name of the relation stored in the database. The test set is optional
and is only used to compute the test metrics. In ``verticapy``, we
don't work using ``X`` matrices and ``y`` vectors. Instead, we work
directly with lists of predictors and the response name.

.. hint::

For clustering and anomaly detection, the use of predictors is
optional. In such cases, all available predictors are considered,
which can include solely numerical variables or a combination of
numerical and categorical variables, depending on the model's
capabilities.

Prediction
^^^^^^^^^^^

Prediction is straight-forward:

.. ipython:: python
:suppress:

result = model.predict(data, ["density", "sulphates"])
html_file = open("figures/machine_learning_vertica_isolation_for_prediction.html", "w")
html_file.write(result._repr_html_())
html_file.close()

.. code-block:: python

model.predict(data, ["density", "sulphates"])

.. raw:: html
:file: SPHINX_DIRECTORY/figures/machine_learning_vertica_isolation_for_prediction.html

Plots - Anomaly Detection
^^^^^^^^^^^^^^^^^^^^^^^^^^

Plots highlighting the outliers can be easily drawn using:

.. code-block:: python

model.plot()

.. ipython:: python
:suppress:

vp.set_option("plotting_lib", "plotly")
fig = model.plot()
fig.write_html("figures/machine_learning_vertica_isolation_for_plot.html")

.. raw:: html
:file: SPHINX_DIRECTORY/figures/machine_learning_vertica_isolation_for_plot.html

.. note::

Most anomaly detection methods produce a score. In scenarios involving
2 or 3 predictors, using a bubble plot to visualize the model's results
is a straightforward approach. In such plots, the size of each bubble
corresponds to the anomaly score.

Plots - Tree
^^^^^^^^^^^^^

Tree models can be visualized by drawing their tree plots.
For more examples, check out :ref:`chart_gallery.tree`.

.. code-block:: python

model.plot_tree()

.. ipython:: python
:suppress:

res = model.plot_tree()
res.render(filename='figures/machine_learning_vertica_tree_isolation_for_', format='png')

.. image:: /../figures/machine_learning_vertica_tree_isolation_for_.png

.. note::

The above example may not render properly in the doc because
of the huge size of the tree. But it should render nicely
in jupyter environment.

In order to plot graph using `graphviz <https://graphviz.org/>`_
separately, you can extract the graphviz DOT file code as follows:

.. ipython:: python

model.to_graphviz()

This string can then be copied into a DOT file which can be
parsed by graphviz.

Plots - Contour
^^^^^^^^^^^^^^^^

In order to understand the parameter space, we can also look
at the contour plots:

.. code-block:: python

model.contour()

.. ipython:: python
:suppress:

fig = model.contour()
fig.write_html("figures/machine_learning_vertica_isolation_for_contour.html")

.. raw:: html
:file: SPHINX_DIRECTORY/figures/machine_learning_vertica_isolation_for_contour.html

.. note::

Machine learning models with two predictors can usually benefit
from their own contour plot. This visual representation aids in
exploring predictions and gaining a deeper understanding of how
these models perform in different scenarios. Please refer to
:ref:`chart_gallery.contour_plot` for more examples.

Parameter Modification
^^^^^^^^^^^^^^^^^^^^^^^

In order to see the parameters:

.. ipython:: python

model.get_params()

And to manually change some of the parameters:

.. ipython:: python

model.set_params({'max_depth': 5})

Model Register
^^^^^^^^^^^^^^

In order to register the model for tracking and versioning:

.. code-block:: python

model.register("model_v1")

Please refer to :ref:`notebooks/ml/model_tracking_versioning/index.html`
for more details on model tracking and versioning.

Model Exporting
^^^^^^^^^^^^^^^^

**To Memmodel**

.. code-block:: python

model.to_memmodel()

.. note::

``MemModel`` objects serve as in-memory representations of machine
learning models. They can be used for both in-database and in-memory
prediction tasks. These objects can be pickled in the same way that
you would pickle a ``scikit-learn`` model.

The preceding methods for exporting the model use ``MemModel``, and it
is recommended to use ``MemModel`` directly.

**To SQL**

You can get the SQL query equivalent of the XGB model by:

.. ipython:: python

model.to_sql()

.. note:: This SQL query can be directly used in any database.

**Deploy SQL**

To get the SQL query which uses Vertica functions use below:

.. ipython:: python

model.deploySQL()

**To Python**

To obtain the prediction function in Python syntax, use the following code:

.. ipython:: python

X = [[0.9, 0.5]]
model.to_python()(X)

.. hint::

The
:py:mod:`verticapy.machine_learning.vertica.tree.IsolationForest.to_python`
method is used to retrieve the anomaly score.
For specific details on how to
use this method for different model types, refer to the relevant
documentation for each model.
"""

# Properties.
Expand Down
1 change: 1 addition & 0 deletions verticapy/machine_learning/vertica/linear_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -324,6 +324,7 @@ class ElasticNet(Regressor, LinearModel):
in training the model. Note that setting
fit_intercept to false does not work well with
the BFGS optimizer.

Examples
---------

Expand Down
Loading