-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finish migration of vars_funs
module for Python package
#32
Merged
jeancochrane
merged 62 commits into
master
from
jeancochrane/further-python-package-migration
Dec 4, 2024
Merged
Changes from all commits
Commits
Show all changes
62 commits
Select commit
Hold shift + click to select a range
0231af4
Add basic Python project with just vars_rename
jeancochrane a738b8b
Add unit tests for vars_rename
jeancochrane adb87a6
Add pytest-coverage workflow
jeancochrane c0739e8
Add Development docs to python/README.md
jeancochrane 6d63b51
Clean up docs in vars_funs.py
jeancochrane f3f38e1
Fix typo in pytest-coverage workflow
jeancochrane abdeca0
Accept any python >=3.9 in python package
jeancochrane ec0a63e
Use optional-dependencies for dev deps in pyproject.toml
jeancochrane 08e418a
Fix vars_rename docstring in Python package
jeancochrane 001b079
Update typing in vars_funs.py to be compatible with Python 3.9
jeancochrane 2aae031
Add Sphinx docs for Python package
jeancochrane 4513bcb
Update actions/checkout versions across workflows
jeancochrane 44fd662
Add Python docs generation to docs workflow
jeancochrane 4070f6f
Fix ruff linter errors
jeancochrane 854aefc
Install both test and docs requirements when running pytest
jeancochrane 8afc1d4
Fix paths in pytest-coverage workflow
jeancochrane 2549daf
Better path management in docs conf.py
jeancochrane 723834e
Rename build jobs in docs workflow
jeancochrane c323370
Include csv files in package data when building Python package
jeancochrane 9cc256b
Temporarily disable branch restriction for docs deployment to test it…
jeancochrane eb9a619
Update deploy-pages version
jeancochrane bbbaf68
Revert "Temporarily disable branch restriction for docs deployment to…
jeancochrane b0055b4
Fix broken link in Python docs
jeancochrane be12f3e
Switch to new style python type hints since we don't support 3.9 anyway
jeancochrane bd6835d
Remove unnecessary templates_path config from pyproject.toml
jeancochrane 5eb1e23
Empty commit to try to bust build-pkgdown-site actions cache
jeancochrane 1f7290e
Draft Python version of vars_recode
jeancochrane 1b6bbef
Remove unnecessary .python-version file
jeancochrane bca6864
Add pip install directions to README and index.rst for Python package
jeancochrane 5960235
Remove unnecessary uv.lock file
jeancochrane aaebf7c
Rename 'test' -> 'dev' in pyproject optional-dependencies
jeancochrane cd5bc3e
Switch order of authors in pyproject.toml
jeancochrane c8f4e49
Capitalize VAR_NAME_PREFIX constant in vars_funs.py
jeancochrane 81d8c20
Remove unnecessary OutputType enum from vars_funs.py
jeancochrane b837865
Remove duplicative type checking in vars_funs.py
jeancochrane 848db78
Wrap Python tests in classes for clearer organization
jeancochrane e66eff3
Change chars_sample fixtures to symlinks to R data in Python package
jeancochrane 442cf51
Merge ccao Python package into jeancochrane/further-python-package-mi…
jeancochrane 008966f
WIP add vars_recode
jeancochrane 8caf3be
Merge 'master' into branch 'jeancochrane/further-python-package-migra…
jeancochrane d132d95
Add tests for vars_recode and fixup logic
jeancochrane 204cccf
Add docs for vars_dict and vars_recode in Python package
jeancochrane 31dffc3
Remove unnecessary select_dtypes filter in Python vars_recode
jeancochrane a8c3233
Add python/ subdir to RBuildignore so it does not get built into R pa…
jeancochrane 7505970
Support Python 3.9, pandas 1.4, and numpy 1.23
jeancochrane df3af62
Try installing pandas/numpy before the other dependencies in pytest-c…
jeancochrane e34b6d7
Try building and testing Python package with tox
jeancochrane 5637158
Add UV_CACHE_DIR to tox env to see if it speeds up builds
jeancochrane 0e29e6f
Revert "Add UV_CACHE_DIR to tox env to see if it speeds up builds"
jeancochrane b410f6e
Restrict tox envs since 3.11 seems to need to build a dep from source
jeancochrane d135b9a
Update docs to fix incorrect EXT_WALL code translation
jeancochrane 2a82c96
Clarify docs for vars_dict data object in reference.rst
jeancochrane f5ee577
Stricter dictionary schema validation in Python version of vars_recode
jeancochrane 67ea0bb
Remove outdated comment in python/ccao/vars_funs.py
jeancochrane b9f300c
Fix wheel caching on CI when using uv in Python package
jeancochrane 815b6a8
Speed up Python install with uv in docs.yaml
jeancochrane 0890d98
Pass env vars to tox defensively
jeancochrane ca15900
Merge pull request #33 from ccao-data/jeancochrane/make-python-packag…
jeancochrane dcf038f
Remove UV_SYSTEM_PYTHON env var from docs workflow
jeancochrane c2ab1ff
Add `shell: bash` config to `Build Python docs` step of docs workflow
jeancochrane dd2d922
Add tmate to docs workflow for debugging
jeancochrane c771664
Run sphinx-build from the correct working directory in docs workflow
jeancochrane File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -26,6 +26,7 @@ | |
^man-roxygen$ | ||
^pkgdown$ | ||
^public$ | ||
^python | ||
^renv$ | ||
^renv\.lock$ | ||
^vignettes$ | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
on: | ||
pull_request: | ||
push: | ||
branches: [main, master] | ||
|
||
name: python-build-and-test | ||
|
||
env: | ||
PYTHONUNBUFFERED: "1" | ||
|
||
jobs: | ||
build-and-test: | ||
runs-on: ubuntu-latest | ||
strategy: | ||
fail-fast: false | ||
matrix: | ||
python-version: ["3.9", "3.10", "3.11", "3.12", "3.13"] | ||
|
||
steps: | ||
- name: Checkout code | ||
uses: actions/checkout@v4 | ||
|
||
- name: Install uv | ||
uses: astral-sh/setup-uv@v4 | ||
with: | ||
enable-cache: true | ||
cache-dependency-glob: python/pyproject.toml | ||
cache-suffix: ${{ matrix.python-version }}-test | ||
|
||
- name: Install Python ${{ matrix.python-version }} | ||
uses: actions/setup-python@v5 | ||
with: | ||
python-version: ${{ matrix.python-version }} | ||
|
||
- name: Install tox | ||
shell: bash | ||
run: | | ||
uv tool install tox --with tox-uv,tox-gh-actions | ||
tox --version | ||
|
||
- name: Build and test with tox | ||
shell: bash | ||
working-directory: python | ||
run: tox r |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
from ccao.vars_funs import vars_dict, vars_rename | ||
from ccao.vars_funs import vars_dict, vars_recode, vars_rename |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,25 +1,26 @@ | ||
# Functions for translating variables between different data sources | ||
import importlib.resources | ||
import typing | ||
|
||
import pandas as pd | ||
|
||
import ccao.data | ||
|
||
# Load the default variable dictionary | ||
_data_path = importlib.resources.files(ccao.data) | ||
vars_dict = pd.read_csv(str(_data_path / "vars_dict.csv")) | ||
vars_dict = pd.read_csv(str(_data_path / "vars_dict.csv"), dtype=str) | ||
|
||
# Prefix we use to identify variable name columns in the variable dictionary | ||
VAR_NAME_PREFIX = "var_name" | ||
|
||
|
||
def vars_rename( | ||
data: list[str] | pd.DataFrame, | ||
data: typing.Union[typing.List[str], pd.DataFrame], | ||
names_from: str, | ||
names_to: str, | ||
output_type: str = "inplace", | ||
dictionary: pd.DataFrame | None = None, | ||
) -> list[str] | pd.DataFrame: | ||
dictionary: typing.Optional[pd.DataFrame] = None, | ||
) -> typing.Union[typing.List[str], pd.DataFrame]: | ||
""" | ||
Rename variables from one naming convention to another. | ||
|
||
|
@@ -126,3 +127,165 @@ def vars_rename( | |
# If the input data is a list, it's not possible to update it inplace, | ||
# so ignore that argument | ||
return [mapping.get(col, col) for col in data] | ||
|
||
|
||
def vars_recode( | ||
data: pd.DataFrame, | ||
cols: typing.Optional[typing.List[str]] = None, | ||
code_type: str = "long", | ||
as_factor: bool = True, | ||
dictionary: typing.Optional[pd.DataFrame] = None, | ||
) -> pd.DataFrame: | ||
""" | ||
Replace numerically coded variables with human-readable values. | ||
|
||
The system of record stores characteristic values in a numerically encoded | ||
format. This function can be used to translate those values into a | ||
human-readable format. For example, EXT_WALL = 2 will become | ||
EXT_WALL = "Masonry". Note that the values and their translations | ||
must be specified via a user-defined dictionary. The default dictionary is | ||
:data:`vars_dict`. | ||
|
||
Options for ``code_type`` are: | ||
|
||
- ``"long"``, which transforms EXT_WALL = 1 to EXT_WALL = Frame | ||
- ``"short"``, which transforms EXT_WALL = 1 to EXT_WALL = FRME | ||
- ``"code"``, which keeps the original values (useful for removing | ||
improperly coded values, see the note below) | ||
|
||
:param data: | ||
A pandas DataFrame with columns to have values replaced. | ||
:type data: pandas.DataFrame | ||
|
||
:param cols: | ||
A list of column names to be transformed, or ``None`` to select all columns. | ||
:type cols: list[str] | ||
|
||
:param code_type: | ||
The recoding type. See description above for options. | ||
:type code_type: str | ||
|
||
:param as_factor: | ||
If True, re-encoded values will be returned as categorical variables | ||
(pandas Categorical). | ||
If False, re-encoded values will be returned as plain strings. | ||
:type as_factor: bool | ||
|
||
:param dictionary: | ||
A pandas DataFrame representing the dictionary used to translate | ||
encodings. | ||
:type dictionary: pandas.DataFrame | ||
|
||
:raises ValueError: | ||
If the dictionary is missing required columns or if invalid input is | ||
provided. | ||
|
||
:return: | ||
The input DataFrame with re-encoded values for the specified columns. | ||
:rtype: pandas.DataFrame | ||
|
||
.. note:: | ||
Values which are in the data but are NOT in the dictionary will be | ||
converted to NaN. | ||
|
||
:example: | ||
|
||
.. code-block:: python | ||
|
||
import ccao | ||
|
||
sample_data = ccao.sample_athena | ||
|
||
# Defaults to `long` code type | ||
ccao.vars_recode(data=sample_data) | ||
|
||
# Recode to `short` code type | ||
ccao.vars_recode(data=sample_data, code_type="short") | ||
|
||
# Recode only specified columns | ||
ccao.vars_recode(data=sample_data, cols="GAR1_SIZE") | ||
""" | ||
# Validate the dictionary schema | ||
dictionary = dictionary if dictionary is not None else vars_dict | ||
if dictionary.empty: | ||
raise ValueError("dictionary must be a non-empty pandas DataFrame") | ||
|
||
required_columns = { | ||
"var_code", | ||
"var_value", | ||
"var_value_short", | ||
"var_type", | ||
"var_data_type", | ||
} | ||
if not required_columns.issubset(dictionary.columns): | ||
raise ValueError( | ||
"Input dictionary must contain the following columns: " | ||
f"{', '.join(required_columns)}" | ||
) | ||
|
||
if not any(col.startswith("var_name_") for col in dictionary.columns): | ||
raise ValueError( | ||
"Input dictionary must contain at least one var_name_ column" | ||
) | ||
|
||
if code_type not in ["short", "long", "code"]: | ||
raise ValueError("code_type must be one of 'short', 'long', or 'code'") | ||
|
||
# Filter the dictionary for categoricals only and and pivot it longer for | ||
# easier lookup | ||
dict_long = dictionary[ | ||
(dictionary["var_type"] == "char") | ||
& (dictionary["var_data_type"] == "categorical") | ||
] | ||
dict_long = dict_long.melt( | ||
id_vars=["var_code", "var_value", "var_value_short"], | ||
value_vars=[ | ||
col for col in dictionary.columns if col.startswith("var_name_") | ||
], | ||
value_name="var_name", | ||
var_name="var_type", | ||
) | ||
dict_long_pkey = ["var_code", "var_value", "var_value_short", "var_name"] | ||
dict_long = dict_long[dict_long_pkey] | ||
dict_long = dict_long.drop_duplicates(subset=dict_long_pkey) | ||
|
||
# Map the code type to its internal representation in the dictionary | ||
values_to = { | ||
"code": "var_code", | ||
"long": "var_value", | ||
"short": "var_value_short", | ||
}[code_type] | ||
|
||
# Function to apply to each column to remap column values based on the | ||
# vars dict | ||
def transform_column( | ||
col: pd.Series, var_name: str, values_to: str, as_factor: bool | ||
) -> typing.Union[pd.Series, pd.Categorical]: | ||
if var_name in dict_long["var_name"].values: | ||
var_rows = dict_long[dict_long["var_name"] == var_name] | ||
# Get a dictionary mapping the possible codes to their values. | ||
# Use `var_code` as the index (keys) for the dictionary, unless | ||
# we're selecting `var_code`, in which case we can't set it as the | ||
# index and use it for values | ||
var_dict = ( | ||
{code: code for code in var_rows["var_code"].tolist()} | ||
if values_to == "var_code" | ||
else var_rows.copy().set_index("var_code")[values_to].to_dict() | ||
) | ||
if as_factor: | ||
return pd.Categorical( | ||
col.map(var_dict), categories=list(var_dict.values()) | ||
) | ||
Comment on lines
+276
to
+278
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seemed like the closest analog to the |
||
else: | ||
return col.map(var_dict) | ||
return col | ||
|
||
# Recode specified columns, or all columns if none were specified | ||
cols = cols or data.columns | ||
for var_name in cols: | ||
if var_name in data.columns: | ||
data[var_name] = transform_column( | ||
data[var_name], var_name, values_to, as_factor | ||
) | ||
|
||
return data |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifying a
str
dtype is necessary here, since otherwise code columns get interpreted as floats by default. We could restrict this type inference to only thevar_code
column, but it seems like everything in this dict should be a string anyway, so I figure we may as well make the type explicit for all columns.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (non-blocking): Won't this prevent matching to numeric values in the original data? i.e.
1
instead of"1"
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically yes, but I think that we ingest all codes as strings when pulling from ias, right? Note that this assumption is currently baked into the R package as well:
ccao/data-raw/vars_dict.R
Lines 1 to 4 in 5be78a6