- 🔑 Easy to extend pluginable architecture.
- 🌀 Several state-of-the-art hardness characterisation methods.
- 📚 Read the docs !
✈️ Checkout the tutorials!
Please note: datagnosis does not handle missing data and so these values must be imputed first HyperImpute can be used to do this.
The library can be installed from PyPI using
$ pip install datagnosis
or from source, using
$ pip install .
Other library extensions:
- Install the library with unit-testing support
pip install datagnosis[testing]
# Load iris dataset from sklearn and create DataHandler object
from sklearn.datasets import load_iris
from datagnosis.plugins.core.datahandler import DataHandler
X, y = load_iris(return_X_y=True, as_frame=True)
datahander = DataHandler(X, y, batch_size=32)
# Create model an parameters
from datagnosis.plugins.core.models.simple_mlp import SimpleMLP
import torch
model = SimpleMLP()
# creating our optimizer and loss function object
learning_rate = 0.01
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(),lr=learning_rate)
# Get a plugin and fit it
hcm = Plugins().get(
"vog",
model=model,
criterion=criterion,
optimizer=optimizer,
lr=learning_rate,
epochs=10,
num_classes=3,
logging_interval=1,
)
hcm.fit(
datahandler=datahander,
use_caches_if_exist=True,
)
# Plot the resulting scores
hcm.plot_scores(axis=1, plot_type="scatter")
Datagnosis builds on D-CAT which is a Hardness Characterization Method Benchmarking framework also from the van der Schaar lab.
For benchmarking of the below methods see https://github.com/seedatnabeel/D-CAT.
Method | Type | Description | Score | Reference |
---|---|---|---|---|
Area Under the Margin (AUM) | Generic | Characterizes data examples based on the margin of a classifier – i.e. the difference between the logit values of the correct class and the next class. | Hard - low scores. | AUM Paper |
Confident Learning | Generic | Confident learning estimates the joint distribution of noisy and true labels — characterizing data as easy and hard for mislabeling. | Hard - low scores | Confident Learning Paper |
Conf Agree | Generic | Agreement measures the agreement of predictions on the same example. | Hard - low scores | Conf Agree Paper |
Data IQ | Generic | Data-IQ computes the aleatoric uncertainty and confidence to characterize the data into easy, ambiguous and hard examples. | Hard - low confidence scores. High Aleatoric Uncertainty scores define ambiguous | Data-IQ Paper |
Data Maps | Generic | Data Maps focuses on measuring variability (epistemic uncertainty) and confidence to characterize the data into easy, ambiguous and hard examples. | Hard - low confidence scores. High Epistemic Uncertainty scores define ambiguous | Data-Maps Paper |
Gradient Normed (GraNd) | Generic | GraNd measures the gradient norm to characterize data. | Hard - high scores | GraNd Paper |
Error L2-Norm (EL2N) | Generic | EL2N calculates the L2 norm of error over training in order to characterize data for computational purposes. | Hard - high scores | EL2N Paper |
Forgetting | Generic | Forgetting scores analyze example transitions through training. i.e., the time a sample correctly learned at one epoch is then forgotten. | Hard - high scores | Forgetting Paper |
Large Loss | Generic | Large Loss characterizes data based on sample-level loss magnitudes. | Hard - high scores | Large Loss Paper |
Prototypicalilty | Generic | Prototypicality calculates the latent space clustering distance of the sample to the class centroid as the metric to characterize data. | Hard - high scores | Prototypicalilty Paper |
Variance of Gradients (VOG) | Generic | VoG (Variance of gradients) estimates the variance of gradients for each sample over training | Hard - high scores | VOG Paper |
Active Learning Guided by Local Sensitivity and Hardness (ALLSH) | Images | ALLSH computes the KL divergence of softmax outputs between original and augmented samples to characterize data. | Hard - high scores | ALLSH Paper |
Generic type plugins can be used for tabular or image data. Image type plugins only work for images.
Install the testing dependencies using
pip install .[testing]
The tests can be executed using
pytest -vvvsx tests/ --durations=50
We want to make contributing to datagnosis is as easy and transparent as possible. We hope to collaborate with as many people as we can.
First create a new environment. It is recommended that you use conda. This can be done as follows:
conda create -n your-datagnosis-env python=3.11
conda activate your-datagnosis-env
Python versions , 3.8, 3.9, 3.10, 3.11 are all compatible, but it is best to use the most up to date version you can, as some models may not support older python versions.
To get the development installation with all the necessary dependencies for linting, testing, auto-formatting, and pre-commit etc. run the following:
git clone https://github.com/vanderschaarlab/datagnosis.git
cd datagnosis
pip install -e .[testing]
Please check that the pre-commit is properly installed for the repository, by running:
pre-commit run --all
This checks that you are set up properly to contribute, such that you will match the code style in the rest of the project. This is covered in more detail below.
We believe that having a consistent code style is incredibly important. Therefore datagnosis imposes certain rules on the code that is contributed and the automated tests will not pass, if the style is not adhered to. These tests passing is a requirement for a contribution being merged. However, we make adhering to this code style as simple as possible. First, all the libraries required to produce code that is compatible with datagnosis's Code Style are installed in the step above when you set up the development environment. Secondly, these libraries are all triggered by pre-commit, so once you are set-up, you don't need to do anything. When you run git commit
, any simple changes to enforce the style will run automatically and other required changes are explained in the stdout for you to go through and fix.
datagnosis uses the black and flake8 code formatter to enforce a common code style across the code base. No additional configuration should be needed (see the black documentation for advanced usage).
Also, datagnosis uses isort to sort imports alphabetically and separate into sections.
datagnosis is fully typed using python 3.7+ type hints. This is enforced for contributions by mypy, which is a static type-checker.
We actively welcome your pull requests.
- Fork the repo and create your branch from
main
. - If you have added code that should be tested, add tests in the same style as those already present in the repo.
- If you have changed APIs, document the API change in the PR.
- Ensure the test suite passes.
- Make sure your code passes the pre-commit, this will be required in order to commit and push, if you have properly installed pre-commit, which is included in the testing extra.
We use GitHub issues to track public bugs. Please ensure your description is clear and has sufficient instructions to be able to reproduce the issue.
By contributing to datagnosis, you agree that your contributions will be licensed under the LICENSE file in the root directory of this source tree. You should therefore, make sure that if you have introduced any dependencies that they also are covered by a license that allows the code to be used by the project and is compatible with the license in the root directory of this project.