Skip to content

Commit

Permalink
Sphinx Docstring Update - machine_learning/veritca/neighbors + DBSCAN…
Browse files Browse the repository at this point in the history
… in cluster + MCA in decomposition (#786)

* Update neighbors.py

* DBSCAN docstring

* Docstring for MCA

* Getting Latest changes from Master (#789)

* Sphinx Docstring Update - Nearest Centroid (#782)

* Update cluster.py

* correcting bugs and doc - Nearest Centroids

* Update cluster.py

* Update cluster.py

---------

Co-authored-by: Badr <[email protected]>

* Correcting format_relation (#785)

* Update naive_bayes.py (#784)

---------

Co-authored-by: Badr <[email protected]>
Co-authored-by: Badr Ouali <[email protected]>

* bug correction

* Update neighbors.py

* corrections

* black

* Update neighbors.py

---------

Co-authored-by: Badr <[email protected]>
Co-authored-by: Badr Ouali <[email protected]>
  • Loading branch information
3 people authored Oct 31, 2023
1 parent cbc8210 commit 92b5fc2
Show file tree
Hide file tree
Showing 4 changed files with 1,116 additions and 38 deletions.
5 changes: 4 additions & 1 deletion verticapy/machine_learning/metrics/classification.py
Original file line number Diff line number Diff line change
Expand Up @@ -226,7 +226,10 @@ def _compute_final_score(
and average != "binary"
):
raise ValueError(
"Parameter 'pos_label' can only be used when parameter 'average' is set to 'binary' or undefined."
"The 'pos_label' parameter can only be used when the 'average' "
"parameter is set to 'binary' or left undefined. This error can "
"also occur when you are using a binary classifier; in that case, "
"the 'average' parameter can only be set to 'binary' or left undefined."
)
if not (isinstance(pos_label, NoneType)) and not (isinstance(labels, NoneType)):
labels = None
Expand Down
194 changes: 179 additions & 15 deletions verticapy/machine_learning/vertica/cluster.py
Original file line number Diff line number Diff line change
Expand Up @@ -1793,21 +1793,32 @@ class DBSCAN(VerticaModel):
compute the distances and neighbors, and uses Python to
compute the cluster propagation (non-scalable phase).
\u26A0 Warning : This algorithm uses a CROSS JOIN
during computation and is therefore
computationally expensive at O(n * n),
where n is the total number of elements.
This algorithm indexes elements of the
table in order to be optimal (the CROSS
JOIN will happen only with IDs which are
integers).
Since DBSCAN uses the p-distance, it
is highly sensitive to unnormalized data.
However, DBSCAN is robust to outliers and
can find non-linear clusters. It is a very
powerful algorithm for outlier detection
and clustering. A table is created at
the end of the learning phase.
.. warning :
This algorithm uses a CROSS JOIN
during computation and is therefore
computationally expensive at O(n * n),
where n is the total number of elements.
This algorithm indexes elements of the
table in order to be optimal (the CROSS
JOIN will happen only with IDs which are
integers).
Since DBSCAN uses the p-distance, it
is highly sensitive to unnormalized data.
However, DBSCAN is robust to outliers and
can find non-linear clusters. It is a very
powerful algorithm for outlier detection
and clustering. A table is created at
the end of the learning phase.
.. important::
This algorithm is not Vertica Native and relies solely
on SQL for attribute computation. While this model does
not take advantage of the benefits provided by a model
management system, including versioning and tracking,
the SQL code it generates can still be used to create a
pipeline.
Parameters
----------
Expand All @@ -1828,6 +1839,159 @@ class DBSCAN(VerticaModel):
p: int, optional
The p of the p-distance (distance metric used
during the model computation).
Examples
---------
The following examples provide a basic understanding of usage.
For more detailed examples, please refer to the
:ref:`user_guide.machine_learning` or the
`Examples <https://www.vertica.com/python/examples/>`_
section on the website.
Load data for machine learning
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
We import ``verticapy``:
.. ipython:: python
import verticapy as vp
.. hint::
By assigning an alias to ``verticapy``, we mitigate the risk of code
collisions with other libraries. This precaution is necessary
because verticapy uses commonly known function names like "average"
and "median", which can potentially lead to naming conflicts.
The use of an alias ensures that the functions from verticapy are
used as intended without interfering with functions from other
libraries.
For this example, we will create a small dataset.
.. ipython:: python
data = vp.vDataFrame({"col":[1.2, 1.1, 1.3, 1.5, 2, 2.2, 1.09, 0.9, 100, 102]})
.. note::
VerticaPy offers a wide range of sample datasets that are
ideal for training and testing purposes. You can explore
the full list of available datasets in the :ref:`api.datasets`,
which provides detailed information on each dataset
and how to use them effectively. These datasets are invaluable
resources for honing your data analysis and machine learning
skills within the VerticaPy environment.
Model Initialization
^^^^^^^^^^^^^^^^^^^^^
First we import the ``DBSCAN`` model:
.. code-block::
from verticapy.machine_learning.vertica import DBSCAN
.. ipython:: python
:suppress:
from verticapy.machine_learning.vertica import DBSCAN
Then we can create the model:
.. ipython:: python
:okwarning:
model = DBSCAN(
eps = 0.5,
min_samples = 2,
p = 2,
)
.. important::
As this model is not native, it solely relies on SQL statements to
compute various attributes, storing them within the object. No data
is saved in the database.
Model Training
^^^^^^^^^^^^^^^
We can now fit the model:
.. ipython:: python
:okwarning:
model.fit(data, X = ["col"])
.. important::
To train a model, you can directly use the ``vDataFrame`` or the
name of the relation stored in the database.
.. hint::
For clustering and anomaly detection, the use of predictors is
optional. In such cases, all available predictors are considered,
which can include solely numerical variables or a combination of
numerical and categorical variables, depending on the model's
capabilities.
.. important::
As this model is not native, it solely relies on SQL statements to
compute various attributes, storing them within the object. No data
is saved in the database.
Prediction
^^^^^^^^^^^
Predicting or ranking the dataset is straight-forward:
.. ipython:: python
:suppress:
result = model.predict()
html_file = open("figures/machine_learning_vertica_dbscan_prediction.html", "w")
html_file.write(result._repr_html_())
html_file.close()
.. code-block:: python
model.predict()
.. raw:: html
:file: SPHINX_DIRECTORY/figures/machine_learning_vertica_dbscan_prediction.html
As shown above, a new column has been created, containing
the clusters.
.. hint::
The name of the new column is optional. If not provided,
it is randomly assigned.
Parameter Modification
^^^^^^^^^^^^^^^^^^^^^^^
In order to see the parameters:
.. ipython:: python
model.get_params()
And to manually change some of the parameters:
.. ipython:: python
model.set_params({'min_samples': 5})
Model Register
^^^^^^^^^^^^^^
As this model is not native, it does not support model management and
versioning. However, it is possible to use the SQL code it generates
for deployment.
"""

# Properties.
Expand Down
Loading

0 comments on commit 92b5fc2

Please sign in to comment.