This is the code associated with our paper where we analyzed various dataset documentation approaches that can help with the responsible development of AI models. See this inventory for all related resources, including the paper.
The overall code is structured according to the FAIR-BioRS guidelines. The Python code in the various Jupyter notebooks follows the PEP8 guidelines. All the dependencies are documented in the environment.yml file.
We recommend using Anaconda to create and manage your development environment and using JupyterLab to run the notebook. All the subsequent instructions are provided assuming you are using Anaconda (Python 3 version) and JupyterLab.
Clone the repo or download as a zip and extract.
Open Anaconda prompt (Windows) or the system Command line interface then naviguate to the code
cd .dataset-documentation-paper-code
$ conda env create -f environment.yml
$ conda activate dataset-documentation-env
$ conda install ipykernel
$ ipython kernel install --user --name=dataset-documentation
$ conda deactivate
The environment variables required are listed in the table below along with information on how to get them
Suggested name | Value or instructions for obtaining it | Purpose |
---|---|---|
GITHUB_ACCESS_TOKEN | https://docs.github.com/en/rest/authentication/authenticating-to-the-rest-api | Required to run the GitHub search code in real-world-usage.ipynb |
Launch Jupyter lab and naviguate to open the Jupyter notebook of interest. Make sure to change the kernel to the one created above called "dataset-documentation" (e.g., see here). We recommend to use the JupyterLab code formatter along with the Black and isort formatters to facilitate compliance with PEP8 if you are editing the notebook.
The Jupyter notebook makes use of files in the dataset associated with the paper (see here). You will need to download the dataset at add it in the inputs folder (call the dataset folder 'dataset' after downloading it).
Outputs of the code include plots and tables displayed in the notebook but also saved as files. These saved plot files are included in the outputs folder.
This work is licensed under MIT. See LICENSE for more information.
Use the GitHub issues for submitting feedback or making suggestions. You can also work the repository and submit a pull request with suggestions.
If you use this code, please cite the related paper (it will be listed here when available) and also cite this repository as:
Simpkins, Kyongmi, Patel, Bhavesh. Code: Dataset Documentation for AI Paper [Software]. Zenodo. https://doi.org/10.5281/zenodo.14583673