Skip to content

Template repository for Python analytic projects

License

Notifications You must be signed in to change notification settings

ghurault/python-analysis-template

Repository files navigation

Python analysis template

Code style: black pre-commit License: MIT

This repository serves as a personal template for data science projects.

File structure

  • Analysis scripts and notebooks are located in analysis/.
  • Reusable functions and modules are stored in the local package src/.
    • The package can then be installed in development mode with pip install -e . for easy prototyping.
    • src/config.py is used to store variables, constants and configurations.
    • The package version is extracted from git tags using setuptools_scm following semantic versioning.
  • Tests for functions in src/ should go to tests/ and follow the convention test_*.py.

Moreover, I use the following the directories that are (usually) ignored by Git:

  • data/ to store data files.
  • docs/ to store API documentation generated with pdoc by running scripts/build_docs.sh.
  • results/ to store results/output files such as figures, output data, etc.

Development environment

I can set up the environment differently depending on the project. The irrelevant sections can be deleted when using the template.

Requirements

The following does not apply when managing requirements with conda, see the section below.

The requirements are specified in the following files:

  • requirements.in to specify direct dependencies.
  • requirements.txt to pin the dependencies (direct and indirect). This is the file used to recreate the environment from scratch using pip install -r requirements.txt.
  • pyproject.toml to store the direct dependencies of the src package.

The requirements.txt file should not be updated manually. Instead, I use pip-compile from pip-tools to generate requirements.txt.

Initial setup

  1. Start with an empty requirements.txt.
  2. Install pip-tools with pip install pip-tools.
  3. Compile requirements with pip-compile to generate a requirements.txt file.
  4. Install requirements with pip-sync (or pip install -r requirements.txt).

NB: the advantage of using pip-sync over pip install -r requirements.txt is that pip-sync will make sure the environment matches requirements.txt, i.e. removing packages in the environment but not in requirements.txt, if required.

Update the environment

  • To upgrade packages, run pip-compile --upgrade.
  • To add new packages, add packages in requirements.in and then compile requirements with pip-compile.

Then, the environment can be updated with pip-sync.

venv setup

Run scripts/setup_venv.sh to setup a Python virtual environment with venv and install packages in requirements.txt. By default, the environment is called .venv and is created using the default Python interpreter in the current directory.

Conda setup

To set up the environment with conda (assuming it is already installed), navigate to the repository directory and run the following in the command line (specify the Python version and environment name as appropriate):

$ conda create -n myenv python=3.11
$ conda activate myenv
$ pip install -r requirements.in
$ pip install -e .

Then pin the requirements with:

$ conda env export > environment.yml

Finally, the environment can be recreated with:

$ conda create -n myenv -f environment.yml

VS Code Dev Containers (Docker)

A Docker container can be used as a development environment. In VS Code, this can be achieved using Dev Containers, which are configured in the .devcontainer directory. The environment is automatically built as follows:

  1. A Docker image of Python is created with packages installed from requirements.txt (except local packages). The Python's version can be edited in the Dockerfile.
  2. The image is ran in a container and the current directory is mounted.
  3. The local packages are installed in the container, along with some VS Code extensions.

To set up the dev container:

  1. Install and launch Docker.
  2. Open the container by using the command palette in VS Code (Ctrl + Shift + P) to search for "Dev Containers: Open Folder in Container...".

If needed, the container can be rebuilt by searching for "Dev Containers: Rebuild Container...".

Private Git packages

If requirements.txt contains Python packages in private Git repositories, it is easier to install them in the devcontainer post-creation step since Git credentials used in VSC are shared with the devcontainer (alternatively, credentials have to be made available in the Dockerfile).

One way to achieve this is to exclude git packages from being installed in the Docker image and update the devcontainer post-creation step to install these packages, similarly to how local package are excluded.

For example, in the Dockerfile:

RUN grep -vE '(^-e|@ ?git ?+)' /tmp/pip-tmp/requirements.txt | pip --no-cache-dir install -r /dev/stdin

And in devcontainer.json:

"postCreateCommand": "grep -E '(^-e|@ ?git ?+)' requirements.txt | pip install -r /dev/stdin"

Setup Git pre-commit hooks

Pre-commit hooks are configured using the pre-commit tool. When this repository is first initialised, the hooks need to be installed with pre-commit install.

Using the template

This section can be deleted when using the template.

Getting started

  1. Initialise your GitHub repository with this template. Alternatively, fork (or copy the content of) this repository.
  2. Update
    • project information in pyproject.toml, such as the description and the authors.
    • the repository name (if the template was forked).
    • the README (title, badges, sections).
    • the license.
  3. Set up your preferred development environment, notably specifying the Python's version.
  4. Add a git tag for the inital version with git tag -a v0.1.0 -m "Initial setup", and push it with git push origin --tags.

VS Code

I usually work with Visual Studio code, for which various settings are already predefined. In particular, I use the following extensions for Python development.

Possible extensions

The src/ package could contain the following modules or sub-packages depending on the project:

  • utils for utility functions.
  • data_processing for data processing functions (this could be imported as dp).
  • features: for extracting features.
  • models: for defining models.
  • evaluation: for evaluating performance.
  • plots: for plotting functions.

The repository structure could be extended with:

  • subfolders in data/ such as data/raw/ for storing raw data.
  • models/ to store model files.

Finally, a full project documentation (beyond the API) could be generated using mkdocs or quartodoc.

Related

This template is inspired by the concept of a research compendium and similar projects I created for R projects (e.g. reproducible-workflow).

This template is relatively simple and tailored to my needs. More sophisticated templates are available elsewhere, such as:

As opposed to other templates, this template is more focused on experimentation rather than sharing a single final product.