Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finish migration of vars_funs module for Python package #32

Merged
merged 62 commits into from
Dec 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
0231af4
Add basic Python project with just vars_rename
jeancochrane Nov 20, 2024
a738b8b
Add unit tests for vars_rename
jeancochrane Nov 20, 2024
adb87a6
Add pytest-coverage workflow
jeancochrane Nov 20, 2024
c0739e8
Add Development docs to python/README.md
jeancochrane Nov 20, 2024
6d63b51
Clean up docs in vars_funs.py
jeancochrane Nov 20, 2024
f3f38e1
Fix typo in pytest-coverage workflow
jeancochrane Nov 20, 2024
abdeca0
Accept any python >=3.9 in python package
jeancochrane Nov 20, 2024
ec0a63e
Use optional-dependencies for dev deps in pyproject.toml
jeancochrane Nov 20, 2024
08e418a
Fix vars_rename docstring in Python package
jeancochrane Nov 20, 2024
001b079
Update typing in vars_funs.py to be compatible with Python 3.9
jeancochrane Nov 20, 2024
2aae031
Add Sphinx docs for Python package
jeancochrane Nov 21, 2024
4513bcb
Update actions/checkout versions across workflows
jeancochrane Nov 21, 2024
44fd662
Add Python docs generation to docs workflow
jeancochrane Nov 21, 2024
4070f6f
Fix ruff linter errors
jeancochrane Nov 21, 2024
854aefc
Install both test and docs requirements when running pytest
jeancochrane Nov 21, 2024
8afc1d4
Fix paths in pytest-coverage workflow
jeancochrane Nov 21, 2024
2549daf
Better path management in docs conf.py
jeancochrane Nov 21, 2024
723834e
Rename build jobs in docs workflow
jeancochrane Nov 21, 2024
c323370
Include csv files in package data when building Python package
jeancochrane Nov 21, 2024
9cc256b
Temporarily disable branch restriction for docs deployment to test it…
jeancochrane Nov 21, 2024
eb9a619
Update deploy-pages version
jeancochrane Nov 21, 2024
bbbaf68
Revert "Temporarily disable branch restriction for docs deployment to…
jeancochrane Nov 21, 2024
b0055b4
Fix broken link in Python docs
jeancochrane Nov 21, 2024
be12f3e
Switch to new style python type hints since we don't support 3.9 anyway
jeancochrane Nov 21, 2024
bd6835d
Remove unnecessary templates_path config from pyproject.toml
jeancochrane Nov 21, 2024
5eb1e23
Empty commit to try to bust build-pkgdown-site actions cache
jeancochrane Nov 21, 2024
1f7290e
Draft Python version of vars_recode
jeancochrane Nov 22, 2024
1b6bbef
Remove unnecessary .python-version file
jeancochrane Nov 25, 2024
bca6864
Add pip install directions to README and index.rst for Python package
jeancochrane Nov 25, 2024
5960235
Remove unnecessary uv.lock file
jeancochrane Nov 25, 2024
aaebf7c
Rename 'test' -> 'dev' in pyproject optional-dependencies
jeancochrane Nov 25, 2024
cd5bc3e
Switch order of authors in pyproject.toml
jeancochrane Nov 25, 2024
c8f4e49
Capitalize VAR_NAME_PREFIX constant in vars_funs.py
jeancochrane Nov 25, 2024
81d8c20
Remove unnecessary OutputType enum from vars_funs.py
jeancochrane Nov 25, 2024
b837865
Remove duplicative type checking in vars_funs.py
jeancochrane Nov 25, 2024
848db78
Wrap Python tests in classes for clearer organization
jeancochrane Nov 25, 2024
e66eff3
Change chars_sample fixtures to symlinks to R data in Python package
jeancochrane Nov 25, 2024
442cf51
Merge ccao Python package into jeancochrane/further-python-package-mi…
jeancochrane Nov 25, 2024
008966f
WIP add vars_recode
jeancochrane Nov 25, 2024
8caf3be
Merge 'master' into branch 'jeancochrane/further-python-package-migra…
jeancochrane Nov 26, 2024
d132d95
Add tests for vars_recode and fixup logic
jeancochrane Nov 26, 2024
204cccf
Add docs for vars_dict and vars_recode in Python package
jeancochrane Nov 26, 2024
31dffc3
Remove unnecessary select_dtypes filter in Python vars_recode
jeancochrane Nov 26, 2024
a8c3233
Add python/ subdir to RBuildignore so it does not get built into R pa…
jeancochrane Nov 27, 2024
7505970
Support Python 3.9, pandas 1.4, and numpy 1.23
jeancochrane Nov 27, 2024
df3af62
Try installing pandas/numpy before the other dependencies in pytest-c…
jeancochrane Nov 27, 2024
e34b6d7
Try building and testing Python package with tox
jeancochrane Dec 2, 2024
5637158
Add UV_CACHE_DIR to tox env to see if it speeds up builds
jeancochrane Dec 2, 2024
0e29e6f
Revert "Add UV_CACHE_DIR to tox env to see if it speeds up builds"
jeancochrane Dec 2, 2024
b410f6e
Restrict tox envs since 3.11 seems to need to build a dep from source
jeancochrane Dec 2, 2024
d135b9a
Update docs to fix incorrect EXT_WALL code translation
jeancochrane Dec 2, 2024
2a82c96
Clarify docs for vars_dict data object in reference.rst
jeancochrane Dec 2, 2024
f5ee577
Stricter dictionary schema validation in Python version of vars_recode
jeancochrane Dec 2, 2024
67ea0bb
Remove outdated comment in python/ccao/vars_funs.py
jeancochrane Dec 2, 2024
b9f300c
Fix wheel caching on CI when using uv in Python package
jeancochrane Dec 2, 2024
815b6a8
Speed up Python install with uv in docs.yaml
jeancochrane Dec 2, 2024
0890d98
Pass env vars to tox defensively
jeancochrane Dec 2, 2024
ca15900
Merge pull request #33 from ccao-data/jeancochrane/make-python-packag…
jeancochrane Dec 4, 2024
dcf038f
Remove UV_SYSTEM_PYTHON env var from docs workflow
jeancochrane Dec 4, 2024
c2ab1ff
Add `shell: bash` config to `Build Python docs` step of docs workflow
jeancochrane Dec 4, 2024
dd2d922
Add tmate to docs workflow for debugging
jeancochrane Dec 4, 2024
c771664
Run sphinx-build from the correct working directory in docs workflow
jeancochrane Dec 4, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .Rbuildignore
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
^man-roxygen$
^pkgdown$
^public$
^python
^renv$
^renv\.lock$
^vignettes$
Expand Down
17 changes: 7 additions & 10 deletions .github/workflows/docs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@ name: docs

env:
PYTHONUNBUFFERED: "1"
UV_SYSTEM_PYTHON: 1

jobs:
build:
Expand All @@ -33,16 +32,12 @@ jobs:
needs: website

- name: Install uv
uses: astral-sh/setup-uv@v3
with:
enable-cache: true
cache-dependency-glob: python/pyproject.toml
cache-suffix: docs
uses: astral-sh/setup-uv@v4

- name: Setup Python
uses: actions/setup-python@v5
with:
python-version-file: python/pyproject.toml
shell: bash
working-directory: python
run: uv venv

- name: Install Python dependencies
working-directory: python
Expand All @@ -54,7 +49,9 @@ jobs:
shell: Rscript {0}

- name: Build Python docs
run: sphinx-build python/docs/source docs/python
working-directory: python
run: uv run sphinx-build docs/source ../docs/python
shell: bash

- name: Configure pages
uses: actions/configure-pages@v5
Expand Down
52 changes: 0 additions & 52 deletions .github/workflows/pytest-coverage.yaml

This file was deleted.

44 changes: 44 additions & 0 deletions .github/workflows/python-build-and-test.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
on:
pull_request:
push:
branches: [main, master]

name: python-build-and-test

env:
PYTHONUNBUFFERED: "1"

jobs:
build-and-test:
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version: ["3.9", "3.10", "3.11", "3.12", "3.13"]

steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Install uv
uses: astral-sh/setup-uv@v4
with:
enable-cache: true
cache-dependency-glob: python/pyproject.toml
cache-suffix: ${{ matrix.python-version }}-test

- name: Install Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}

- name: Install tox
shell: bash
run: |
uv tool install tox --with tox-uv,tox-gh-actions
tox --version

- name: Build and test with tox
shell: bash
working-directory: python
run: tox r
2 changes: 1 addition & 1 deletion R/vars_funs.R
Original file line number Diff line number Diff line change
Expand Up @@ -200,7 +200,7 @@ vars_rename <- function(data,
#' @description The system of record stores characteristic values in a
#' numerically encoded format. This function can be used to translate those
#' values into a human-readable format. For example, EXT_WALL = 2 will become
#' EXT_WALL = "Frame + Masonry". Note that the values and their translations are
#' EXT_WALL = "Masonry". Note that the values and their translations are
#' must be specified via a user-defined dictionary. The default dictionary is
#' \code{\link{vars_dict}}.
#'
Expand Down
2 changes: 1 addition & 1 deletion python/ccao/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
from ccao.vars_funs import vars_dict, vars_rename
from ccao.vars_funs import vars_dict, vars_recode, vars_rename
171 changes: 167 additions & 4 deletions python/ccao/vars_funs.py
Original file line number Diff line number Diff line change
@@ -1,25 +1,26 @@
# Functions for translating variables between different data sources
import importlib.resources
import typing

import pandas as pd

import ccao.data

# Load the default variable dictionary
_data_path = importlib.resources.files(ccao.data)
vars_dict = pd.read_csv(str(_data_path / "vars_dict.csv"))
vars_dict = pd.read_csv(str(_data_path / "vars_dict.csv"), dtype=str)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifying a str dtype is necessary here, since otherwise code columns get interpreted as floats by default. We could restrict this type inference to only the var_code column, but it seems like everything in this dict should be a string anyway, so I figure we may as well make the type explicit for all columns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (non-blocking): Won't this prevent matching to numeric values in the original data? i.e. 1 instead of "1".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically yes, but I think that we ingest all codes as strings when pulling from ias, right? Note that this assumption is currently baked into the R package as well:

vars_dict <- readr::read_csv(
file = "data-raw/vars_dict.csv",
col_types = readr::cols(var_code = readr::col_character())
)


# Prefix we use to identify variable name columns in the variable dictionary
VAR_NAME_PREFIX = "var_name"


def vars_rename(
data: list[str] | pd.DataFrame,
data: typing.Union[typing.List[str], pd.DataFrame],
names_from: str,
names_to: str,
output_type: str = "inplace",
dictionary: pd.DataFrame | None = None,
) -> list[str] | pd.DataFrame:
dictionary: typing.Optional[pd.DataFrame] = None,
) -> typing.Union[typing.List[str], pd.DataFrame]:
"""
Rename variables from one naming convention to another.

Expand Down Expand Up @@ -126,3 +127,165 @@ def vars_rename(
# If the input data is a list, it's not possible to update it inplace,
# so ignore that argument
return [mapping.get(col, col) for col in data]


def vars_recode(
data: pd.DataFrame,
cols: typing.Optional[typing.List[str]] = None,
code_type: str = "long",
as_factor: bool = True,
dictionary: typing.Optional[pd.DataFrame] = None,
) -> pd.DataFrame:
"""
Replace numerically coded variables with human-readable values.

The system of record stores characteristic values in a numerically encoded
format. This function can be used to translate those values into a
human-readable format. For example, EXT_WALL = 2 will become
EXT_WALL = "Masonry". Note that the values and their translations
must be specified via a user-defined dictionary. The default dictionary is
:data:`vars_dict`.

Options for ``code_type`` are:

- ``"long"``, which transforms EXT_WALL = 1 to EXT_WALL = Frame
- ``"short"``, which transforms EXT_WALL = 1 to EXT_WALL = FRME
- ``"code"``, which keeps the original values (useful for removing
improperly coded values, see the note below)

:param data:
A pandas DataFrame with columns to have values replaced.
:type data: pandas.DataFrame

:param cols:
A list of column names to be transformed, or ``None`` to select all columns.
:type cols: list[str]

:param code_type:
The recoding type. See description above for options.
:type code_type: str

:param as_factor:
If True, re-encoded values will be returned as categorical variables
(pandas Categorical).
If False, re-encoded values will be returned as plain strings.
:type as_factor: bool

:param dictionary:
A pandas DataFrame representing the dictionary used to translate
encodings.
:type dictionary: pandas.DataFrame

:raises ValueError:
If the dictionary is missing required columns or if invalid input is
provided.

:return:
The input DataFrame with re-encoded values for the specified columns.
:rtype: pandas.DataFrame

.. note::
Values which are in the data but are NOT in the dictionary will be
converted to NaN.

:example:

.. code-block:: python

import ccao

sample_data = ccao.sample_athena

# Defaults to `long` code type
ccao.vars_recode(data=sample_data)

# Recode to `short` code type
ccao.vars_recode(data=sample_data, code_type="short")

# Recode only specified columns
ccao.vars_recode(data=sample_data, cols="GAR1_SIZE")
"""
# Validate the dictionary schema
dictionary = dictionary if dictionary is not None else vars_dict
if dictionary.empty:
raise ValueError("dictionary must be a non-empty pandas DataFrame")

required_columns = {
"var_code",
"var_value",
"var_value_short",
"var_type",
"var_data_type",
}
if not required_columns.issubset(dictionary.columns):
raise ValueError(
"Input dictionary must contain the following columns: "
f"{', '.join(required_columns)}"
)

if not any(col.startswith("var_name_") for col in dictionary.columns):
raise ValueError(
"Input dictionary must contain at least one var_name_ column"
)

if code_type not in ["short", "long", "code"]:
raise ValueError("code_type must be one of 'short', 'long', or 'code'")

# Filter the dictionary for categoricals only and and pivot it longer for
# easier lookup
dict_long = dictionary[
(dictionary["var_type"] == "char")
& (dictionary["var_data_type"] == "categorical")
]
dict_long = dict_long.melt(
id_vars=["var_code", "var_value", "var_value_short"],
value_vars=[
col for col in dictionary.columns if col.startswith("var_name_")
],
value_name="var_name",
var_name="var_type",
)
dict_long_pkey = ["var_code", "var_value", "var_value_short", "var_name"]
dict_long = dict_long[dict_long_pkey]
dict_long = dict_long.drop_duplicates(subset=dict_long_pkey)

# Map the code type to its internal representation in the dictionary
values_to = {
"code": "var_code",
"long": "var_value",
"short": "var_value_short",
}[code_type]

# Function to apply to each column to remap column values based on the
# vars dict
def transform_column(
col: pd.Series, var_name: str, values_to: str, as_factor: bool
) -> typing.Union[pd.Series, pd.Categorical]:
if var_name in dict_long["var_name"].values:
var_rows = dict_long[dict_long["var_name"] == var_name]
# Get a dictionary mapping the possible codes to their values.
# Use `var_code` as the index (keys) for the dictionary, unless
# we're selecting `var_code`, in which case we can't set it as the
# index and use it for values
var_dict = (
{code: code for code in var_rows["var_code"].tolist()}
if values_to == "var_code"
else var_rows.copy().set_index("var_code")[values_to].to_dict()
)
if as_factor:
return pd.Categorical(
col.map(var_dict), categories=list(var_dict.values())
)
Comment on lines +276 to +278
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seemed like the closest analog to the as_factor behavior in R, but let me know if you have a different preference for how we should handle it.

else:
return col.map(var_dict)
return col

# Recode specified columns, or all columns if none were specified
cols = cols or data.columns
for var_name in cols:
if var_name in data.columns:
data[var_name] = transform_column(
data[var_name], var_name, values_to, as_factor
)

return data
17 changes: 15 additions & 2 deletions python/docs/source/reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,19 @@ Manage characteristics
^^^^^^^^^^^^^^^^^^^^^^

Recode/rename characteristic columns, merge HIE data, and fix characteristic
errors.
errors

:doc:`vars_rename() <vars_rename>`
:doc:`vars_rename() <vars_rename>` |nbsp|
:doc:`vars_recode() <vars_recode>`

Data
----

Dictionaries
^^^^^^^^^^^^

Lookups for numeric codes and variable names used in the assessment system

:doc:`vars_dict <vars_dict>`

.. |nbsp| unicode:: 0xA0
Loading
Loading