Skip to content

Commit

Permalink
added tutorial example
Browse files Browse the repository at this point in the history
  • Loading branch information
exs-whaddadin committed Nov 24, 2023
1 parent f421c9f commit e7f4034
Show file tree
Hide file tree
Showing 4 changed files with 192 additions and 0 deletions.
13 changes: 13 additions & 0 deletions docs/source/pages/standard_api/intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,19 @@ API of the package. Learning to use the following functionality gives the user i
of the package. Each of the following methods can be imported from the relevant submodule (for example
``from molflux.datasets import load_from_dict`).

## Browsing

To start, we first introduce the basic browsing functionality of the submodules. Each submodule has a ``list_*`` function
that returns a dictionary of available objects (datasets, representations, models, etc...). These are

1) ``list_datasets``
2) ``list_representations``
3) ``list_splits``
4) ``list_models``
5) ``list_metrics``

The dictionaries returned are grouped by the optional dependency required for the objects (key) and the list of available
objects (value).

## Loading

Expand Down
175 changes: 175 additions & 0 deletions docs/source/pages/tutorials/esol_training.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
---
jupytext:
formats: md:myst
text_representation:
extension: .md
format_name: myst
kernelspec:
display_name: Python 3
language: python
name: python3
---

# ESOL Training

In this tutorial we provide a simple example of training a random forest model on the ESOL dataset. We require the ``rdkit``
package, so make sure to ``pip install 'molflux[rdkit]'`` to follow along!


## Loading the ESOL dataset

First, let's load the ESOL dataset

```{code-cell} ipython3
from molflux.datasets import load_dataset
dataset = load_dataset("esol")
print(dataset)
print(dataset[0])
```

You can see that there are two columns: ``smiles`` and ``log_solubility``.


## Featurising

Now, we will featurise the dataset. For this, we will use the Morgan and MACCS fingerprints from ``rdkit`` and the
``featurise_dataset`` function from ``molflux.datasets``.

```{code-cell} ipython3
from molflux.datasets import load_dataset, featurise_dataset
from molflux.features import load_from_dicts as load_representations_from_dicts
dataset = load_dataset("esol")
featuriser = load_representations_from_dicts(
[
{"name": "morgan"},
{"name": "maccs_rdkit"},
]
)
featurised_dataset = featurise_dataset(dataset, column="smiles", representations=featuriser)
print(featurised_dataset)
```

You can see that we now have two extra columns for each fingerprint we used.

## Splitting

Next, we need to split the dataset. For this, we use the simple ``shuffle_split`` (random split) with 80% training and
20% test. To split the dataset, we use the ``split_dataset`` function from ``molflux.datasets``.

```{code-cell} ipython3
from molflux.datasets import load_dataset, featurise_dataset, split_dataset
from molflux.features import load_from_dicts as load_representations_from_dicts
from molflux.splits import load_from_dict as load_split_from_dict
dataset = load_dataset("esol")
featuriser = load_representations_from_dicts(
[
{"name": "morgan"},
{"name": "maccs_rdkit"},
]
)
featurised_dataset = featurise_dataset(dataset, column="smiles", representations=featuriser)
shuffle_strategy = load_split_from_dict(
{
"name": "shuffle_split",
"presets": {
"train_fraction": 0.8,
"validation_fraction": 0.0,
"test_fraction": 0.2,
}
}
)
split_featurised_dataset = next(split_dataset(featurised_dataset, shuffle_strategy))
print(split_featurised_dataset)
```


## Training the model

We can now turn to training the model! We choose the ``random_forest_regressor`` (which we access from the ``sklearn`` package).
To do so, we need to define the model config and the ``x_features`` and the ``y_features``.

Once trained, we will get some predictions and compute some metrics!

```{code-cell} ipython3
import json
from molflux.datasets import load_dataset, featurise_dataset, split_dataset
from molflux.features import load_from_dicts as load_representations_from_dicts
from molflux.splits import load_from_dict as load_split_from_dict
from molflux.modelzoo import load_from_dict as load_model_from_dict
from molflux.metrics import load_suite
import matplotlib.pyplot as plt
dataset = load_dataset("esol")
featuriser = load_representations_from_dicts(
[
{"name": "morgan"},
{"name": "maccs_rdkit"},
]
)
featurised_dataset = featurise_dataset(dataset, column="smiles", representations=featuriser)
shuffle_strategy = load_split_from_dict(
{
"name": "shuffle_split",
"presets": {
"train_fraction": 0.8,
"validation_fraction": 0.0,
"test_fraction": 0.2,
}
}
)
split_featurised_dataset = next(split_dataset(featurised_dataset, shuffle_strategy))
model = load_model_from_dict(
{
"name": "random_forest_regressor",
"config": {
"x_features": ['smiles::morgan', 'smiles::maccs_rdkit'],
"y_features": ['log_solubility'],
}
}
)
model.train(split_featurised_dataset["train"])
preds = model.predict(split_featurised_dataset["test"])
regression_suite = load_suite("regression")
scores = regression_suite.compute(
references=split_featurised_dataset["test"]["log_solubility"],
predictions=preds["random_forest_regressor::log_solubility"],
)
print(json.dumps(scores, indent=4))
plt.scatter(
split_featurised_dataset["test"]["log_solubility"],
preds["random_forest_regressor::log_solubility"],
)
plt.xlabel("True values")
plt.ylabel("Predicted values")
plt.show()
```
2 changes: 2 additions & 0 deletions docs/source/pages/tutorials/index.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
# Tutorials

Here, we provide a number of tutorials for different use cases to showcase the functionality of ``molflux``.

[Esol Training](esol_training.md)
2 changes: 2 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,8 @@ docs = [
'sphinx_design',
'jupytext>=1.11.2',
'myst-nb',
'rdkit>=2023.9.1',
'matplotlib'
]
tests = [
'coverage[toml]',
Expand Down

0 comments on commit e7f4034

Please sign in to comment.