This repository serves as a personal template for data science projects.
- Analysis scripts and notebooks are located in
analysis/
. - Reusable functions and modules are stored in the local package
src/
.- The package can then be installed in development mode with
pip install -e .
for easy prototyping. src/config.py
is used to store variables, constants and configurations.- The package version is extracted from git tags using setuptools_scm following semantic versioning.
- The package can then be installed in development mode with
- Tests for functions in
src/
should go totests/
and follow the conventiontest_*.py
.
Moreover, I use the following the directories that are (usually) ignored by Git:
data/
to store data files.docs/
to store API documentation generated with pdoc by runningscripts/build_docs.sh
.results/
to store results/output files such as figures, output data, etc.
I can set up the environment differently depending on the project. The irrelevant sections can be deleted when using the template.
The following does not apply when managing requirements with conda, see the section below.
The requirements are specified in the following files:
requirements.in
to specify direct dependencies.requirements.txt
to pin the dependencies (direct and indirect). This is the file used to recreate the environment from scratch usingpip install -r requirements.txt
.pyproject.toml
to store the direct dependencies of thesrc
package.
The requirements.txt
file should not be updated manually.
Instead, I use pip-compile
from pip-tools to generate requirements.txt
.
- Start with an empty
requirements.txt
. - Install pip-tools with
pip install pip-tools
. - Compile requirements with
pip-compile
to generate arequirements.txt
file. - Install requirements with
pip-sync
(orpip install -r requirements.txt
).
NB: the advantage of using pip-sync
over pip install -r requirements.txt
is that pip-sync
will make sure the environment matches requirements.txt
, i.e. removing packages in the environment but not in requirements.txt
, if required.
- To upgrade packages, run
pip-compile --upgrade
. - To add new packages, add packages in
requirements.in
and then compile requirements withpip-compile
.
Then, the environment can be updated with pip-sync
.
Run scripts/setup_venv.sh
to setup a Python virtual environment with venv and install packages in requirements.txt
.
By default, the environment is called .venv
and is created using the default Python interpreter in the current directory.
To set up the environment with conda (assuming it is already installed), navigate to the repository directory and run the following in the command line (specify the Python version and environment name as appropriate):
$ conda create -n myenv python=3.11
$ conda activate myenv
$ pip install -r requirements.in
$ pip install -e .
Then pin the requirements with:
$ conda env export > environment.yml
Finally, the environment can be recreated with:
$ conda create -n myenv -f environment.yml
A Docker container can be used as a development environment.
In VS Code, this can be achieved using Dev Containers, which are configured in the .devcontainer
directory.
The environment is automatically built as follows:
- A Docker image of Python is created with packages installed from
requirements.txt
(except local packages). The Python's version can be edited in the Dockerfile. - The image is ran in a container and the current directory is mounted.
- The local packages are installed in the container, along with some VS Code extensions.
To set up the dev container:
- Install and launch Docker.
- Open the container by using the command palette in VS Code (
Ctrl + Shift + P
) to search for "Dev Containers: Open Folder in Container...".
If needed, the container can be rebuilt by searching for "Dev Containers: Rebuild Container...".
If requirements.txt
contains Python packages in private Git repositories, it is easier to install them in the devcontainer post-creation step since Git credentials used in VSC are shared with the devcontainer (alternatively, credentials have to be made available in the Dockerfile).
One way to achieve this is to exclude git packages from being installed in the Docker image and update the devcontainer post-creation step to install these packages, similarly to how local package are excluded.
For example, in the Dockerfile:
RUN grep -vE '(^-e|@ ?git ?+)' /tmp/pip-tmp/requirements.txt | pip --no-cache-dir install -r /dev/stdin
And in devcontainer.json
:
"postCreateCommand": "grep -E '(^-e|@ ?git ?+)' requirements.txt | pip install -r /dev/stdin"
Pre-commit hooks are configured using the pre-commit tool.
When this repository is first initialised, the hooks need to be installed with pre-commit install
.
This section can be deleted when using the template.
- Initialise your GitHub repository with this template. Alternatively, fork (or copy the content of) this repository.
- Update
- project information in
pyproject.toml
, such as the description and the authors. - the repository name (if the template was forked).
- the README (title, badges, sections).
- the license.
- project information in
- Set up your preferred development environment, notably specifying the Python's version.
- Add a git tag for the inital version with
git tag -a v0.1.0 -m "Initial setup"
, and push it withgit push origin --tags
.
I usually work with Visual Studio code, for which various settings are already predefined. In particular, I use the following extensions for Python development.
- Black for formatting.
- Flake8 and SonarLint for linting.
- autoDocstring extension to generate docstrings skeleton following the Google docstring format.
The src/
package could contain the following modules or sub-packages depending on the project:
utils
for utility functions.data_processing
for data processing functions (this could be imported asdp
).features
: for extracting features.models
: for defining models.evaluation
: for evaluating performance.plots
: for plotting functions.
The repository structure could be extended with:
- subfolders in
data/
such asdata/raw/
for storing raw data. models/
to store model files.
Finally, a full project documentation (beyond the API) could be generated using mkdocs or quartodoc.
This template is inspired by the concept of a research compendium and similar projects I created for R projects (e.g. reproducible-workflow).
This template is relatively simple and tailored to my needs. More sophisticated templates are available elsewhere, such as:
- Cookiecutter Data Science.
- https://joserzapata.github.io/data-science-project-template/
- Data Science for Social Good's hitchhikers guide template
- https://github.com/khuyentran1401/data-science-template
As opposed to other templates, this template is more focused on experimentation rather than sharing a single final product.