Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finish migration of vars_funs module for Python package #32

Merged
merged 62 commits into from
Dec 4, 2024
Merged
Show file tree
Hide file tree
Changes from 43 commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
0231af4
Add basic Python project with just vars_rename
jeancochrane Nov 20, 2024
a738b8b
Add unit tests for vars_rename
jeancochrane Nov 20, 2024
adb87a6
Add pytest-coverage workflow
jeancochrane Nov 20, 2024
c0739e8
Add Development docs to python/README.md
jeancochrane Nov 20, 2024
6d63b51
Clean up docs in vars_funs.py
jeancochrane Nov 20, 2024
f3f38e1
Fix typo in pytest-coverage workflow
jeancochrane Nov 20, 2024
abdeca0
Accept any python >=3.9 in python package
jeancochrane Nov 20, 2024
ec0a63e
Use optional-dependencies for dev deps in pyproject.toml
jeancochrane Nov 20, 2024
08e418a
Fix vars_rename docstring in Python package
jeancochrane Nov 20, 2024
001b079
Update typing in vars_funs.py to be compatible with Python 3.9
jeancochrane Nov 20, 2024
2aae031
Add Sphinx docs for Python package
jeancochrane Nov 21, 2024
4513bcb
Update actions/checkout versions across workflows
jeancochrane Nov 21, 2024
44fd662
Add Python docs generation to docs workflow
jeancochrane Nov 21, 2024
4070f6f
Fix ruff linter errors
jeancochrane Nov 21, 2024
854aefc
Install both test and docs requirements when running pytest
jeancochrane Nov 21, 2024
8afc1d4
Fix paths in pytest-coverage workflow
jeancochrane Nov 21, 2024
2549daf
Better path management in docs conf.py
jeancochrane Nov 21, 2024
723834e
Rename build jobs in docs workflow
jeancochrane Nov 21, 2024
c323370
Include csv files in package data when building Python package
jeancochrane Nov 21, 2024
9cc256b
Temporarily disable branch restriction for docs deployment to test it…
jeancochrane Nov 21, 2024
eb9a619
Update deploy-pages version
jeancochrane Nov 21, 2024
bbbaf68
Revert "Temporarily disable branch restriction for docs deployment to…
jeancochrane Nov 21, 2024
b0055b4
Fix broken link in Python docs
jeancochrane Nov 21, 2024
be12f3e
Switch to new style python type hints since we don't support 3.9 anyway
jeancochrane Nov 21, 2024
bd6835d
Remove unnecessary templates_path config from pyproject.toml
jeancochrane Nov 21, 2024
5eb1e23
Empty commit to try to bust build-pkgdown-site actions cache
jeancochrane Nov 21, 2024
1f7290e
Draft Python version of vars_recode
jeancochrane Nov 22, 2024
1b6bbef
Remove unnecessary .python-version file
jeancochrane Nov 25, 2024
bca6864
Add pip install directions to README and index.rst for Python package
jeancochrane Nov 25, 2024
5960235
Remove unnecessary uv.lock file
jeancochrane Nov 25, 2024
aaebf7c
Rename 'test' -> 'dev' in pyproject optional-dependencies
jeancochrane Nov 25, 2024
cd5bc3e
Switch order of authors in pyproject.toml
jeancochrane Nov 25, 2024
c8f4e49
Capitalize VAR_NAME_PREFIX constant in vars_funs.py
jeancochrane Nov 25, 2024
81d8c20
Remove unnecessary OutputType enum from vars_funs.py
jeancochrane Nov 25, 2024
b837865
Remove duplicative type checking in vars_funs.py
jeancochrane Nov 25, 2024
848db78
Wrap Python tests in classes for clearer organization
jeancochrane Nov 25, 2024
e66eff3
Change chars_sample fixtures to symlinks to R data in Python package
jeancochrane Nov 25, 2024
442cf51
Merge ccao Python package into jeancochrane/further-python-package-mi…
jeancochrane Nov 25, 2024
008966f
WIP add vars_recode
jeancochrane Nov 25, 2024
8caf3be
Merge 'master' into branch 'jeancochrane/further-python-package-migra…
jeancochrane Nov 26, 2024
d132d95
Add tests for vars_recode and fixup logic
jeancochrane Nov 26, 2024
204cccf
Add docs for vars_dict and vars_recode in Python package
jeancochrane Nov 26, 2024
31dffc3
Remove unnecessary select_dtypes filter in Python vars_recode
jeancochrane Nov 26, 2024
a8c3233
Add python/ subdir to RBuildignore so it does not get built into R pa…
jeancochrane Nov 27, 2024
7505970
Support Python 3.9, pandas 1.4, and numpy 1.23
jeancochrane Nov 27, 2024
df3af62
Try installing pandas/numpy before the other dependencies in pytest-c…
jeancochrane Nov 27, 2024
e34b6d7
Try building and testing Python package with tox
jeancochrane Dec 2, 2024
5637158
Add UV_CACHE_DIR to tox env to see if it speeds up builds
jeancochrane Dec 2, 2024
0e29e6f
Revert "Add UV_CACHE_DIR to tox env to see if it speeds up builds"
jeancochrane Dec 2, 2024
b410f6e
Restrict tox envs since 3.11 seems to need to build a dep from source
jeancochrane Dec 2, 2024
d135b9a
Update docs to fix incorrect EXT_WALL code translation
jeancochrane Dec 2, 2024
2a82c96
Clarify docs for vars_dict data object in reference.rst
jeancochrane Dec 2, 2024
f5ee577
Stricter dictionary schema validation in Python version of vars_recode
jeancochrane Dec 2, 2024
67ea0bb
Remove outdated comment in python/ccao/vars_funs.py
jeancochrane Dec 2, 2024
b9f300c
Fix wheel caching on CI when using uv in Python package
jeancochrane Dec 2, 2024
815b6a8
Speed up Python install with uv in docs.yaml
jeancochrane Dec 2, 2024
0890d98
Pass env vars to tox defensively
jeancochrane Dec 2, 2024
ca15900
Merge pull request #33 from ccao-data/jeancochrane/make-python-packag…
jeancochrane Dec 4, 2024
dcf038f
Remove UV_SYSTEM_PYTHON env var from docs workflow
jeancochrane Dec 4, 2024
c2ab1ff
Add `shell: bash` config to `Build Python docs` step of docs workflow
jeancochrane Dec 4, 2024
dd2d922
Add tmate to docs workflow for debugging
jeancochrane Dec 4, 2024
c771664
Run sphinx-build from the correct working directory in docs workflow
jeancochrane Dec 4, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion python/ccao/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
from ccao.vars_funs import vars_dict, vars_rename
from ccao.vars_funs import vars_dict, vars_recode, vars_rename
154 changes: 153 additions & 1 deletion python/ccao/vars_funs.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

# Load the default variable dictionary
_data_path = importlib.resources.files(ccao.data)
vars_dict = pd.read_csv(str(_data_path / "vars_dict.csv"))
vars_dict = pd.read_csv(str(_data_path / "vars_dict.csv"), dtype=str)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifying a str dtype is necessary here, since otherwise code columns get interpreted as floats by default. We could restrict this type inference to only the var_code column, but it seems like everything in this dict should be a string anyway, so I figure we may as well make the type explicit for all columns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (non-blocking): Won't this prevent matching to numeric values in the original data? i.e. 1 instead of "1".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically yes, but I think that we ingest all codes as strings when pulling from ias, right? Note that this assumption is currently baked into the R package as well:

vars_dict <- readr::read_csv(
file = "data-raw/vars_dict.csv",
col_types = readr::cols(var_code = readr::col_character())
)


# Prefix we use to identify variable name columns in the variable dictionary
VAR_NAME_PREFIX = "var_name"
Expand Down Expand Up @@ -126,3 +126,155 @@ def vars_rename(
# If the input data is a list, it's not possible to update it inplace,
# so ignore that argument
return [mapping.get(col, col) for col in data]


def vars_recode(
data: pd.DataFrame,
cols: list[str] | None = None,
code_type: str = "long",
as_factor: bool = True,
dictionary: pd.DataFrame | None = None,
) -> pd.DataFrame:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied over this interface and its docs pretty much as-is from the R package, just renaming the type and dict params to reflect the fact that those are reserved keywords in Python:

ccao/R/vars_funs.R

Lines 282 to 286 in 5be78a6

vars_recode <- function(data,
cols = dplyr::everything(),
type = "long",
as_factor = TRUE,
dict = ccao::vars_dict) {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really like to keep the interfaces across the R and Python versions of our major packages the same. Maybe we can rename the R function inputs here and then release a major version as we did with assesspy. While we're at it we can trim out some of the unused functionality of the R version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I'll take a stab at doing this in a fast follow PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up PR here: #34

"""
Replace numerically coded variables with human-readable values.

The system of record stores characteristic values in a numerically encoded
format. This function can be used to translate those values into a
human-readable format. For example, EXT_WALL = 2 will become
EXT_WALL = "Frame + Masonry". Note that the values and their translations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: It's wrong in the R docs too, but EXT_WALL = 2 is "Masonry", not "Frame + Masonry".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, thanks! Fixed in d135b9a.

must be specified via a user-defined dictionary. The default dictionary is
:data:`vars_dict`.

Options for ``code_type`` are:

- ``"long"``, which transforms EXT_WALL = 1 to EXT_WALL = Frame
- ``"short"``, which transforms EXT_WALL = 1 to EXT_WALL = FRME
- ``"code"``, which keeps the original values (useful for removing
improperly coded values, see the note below)

:param data:
A pandas DataFrame with columns to have values replaced.
:type data: pandas.DataFrame

:param cols:
A list of column names to be transformed, or ``None`` to select all columns.
:type cols: list[str]

:param code_type:
The recoding type. See description above for options.
:type code_type: str

:param as_factor:
If True, re-encoded values will be returned as categorical variables
(pandas Categorical).
If False, re-encoded values will be returned as plain strings.
:type as_factor: bool

:param dictionary:
A pandas DataFrame representing the dictionary used to translate
encodings.
:type dictionary: pandas.DataFrame

:raises ValueError:
If the dictionary is missing required columns or if invalid input is
provided.

:return:
The input DataFrame with re-encoded values for the specified columns.
:rtype: pandas.DataFrame

.. note::
Values which are in the data but are NOT in the dictionary will be
converted to NaN.

:example:

.. code-block:: python

import ccao

sample_data = ccao.sample_athena

# Defaults to `long` code type
ccao.vars_recode(data=sample_data)

# Recode to `short` code type
ccao.vars_recode(data=sample_data, code_type="short")

# Recode only specified columns
ccao.vars_recode(data=sample_data, cols="GAR1_SIZE")
"""
# Validate the dictionary schema
dictionary = dictionary if dictionary is not None else vars_dict
if dictionary.empty:
raise ValueError("dictionary must be a non-empty pandas DataFrame")

required_columns = {"var_code", "var_value", "var_value_short"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (blocking): var_name, var_type, and var_data_type should be required as well since they're used below.

Edit: Plus all the var_name_ columns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, I added stricter validation in f5ee577. Note that we check for any column matching the pattern var_name_*, since technically the function doesn't care what the names are, or how many of them there are.

if not required_columns.issubset(dictionary.columns):
raise ValueError(
"Input dictionary must contain the following columns: "
f"{', '.join(required_columns)}"
)

# Validate code type and convert it to the enum
if code_type not in ["short", "long", "code"]:
raise ValueError("code_type must be one of 'short', 'long', or 'code'")
jeancochrane marked this conversation as resolved.
Show resolved Hide resolved

# Filter the dictionary for categoricals only and and pivot it longer for
# easier lookup
dict_long = dictionary[
(dictionary["var_type"] == "char")
& (dictionary["var_data_type"] == "categorical")
]
dict_long = dict_long.melt(
id_vars=["var_code", "var_value", "var_value_short"],
value_vars=[
col for col in dictionary.columns if col.startswith("var_name_")
],
value_name="var_name",
var_name="var_type",
)
dict_long_pkey = ["var_code", "var_value", "var_value_short", "var_name"]
dict_long = dict_long[dict_long_pkey]
dict_long = dict_long.drop_duplicates(subset=dict_long_pkey)

# Map the code type to its internal representation in the dictionary
values_to = {
"code": "var_code",
"long": "var_value",
"short": "var_value_short",
}[code_type]

# Function to apply to each column to remap column values based on the
# vars dict
def transform_column(
col: pd.Series, var_name: str, values_to: str, as_factor: bool
) -> pd.Series | pd.Categorical:
if var_name in dict_long["var_name"].values:
var_rows = dict_long[dict_long["var_name"] == var_name]
# Get a dictionary mapping the possible codes to their values.
# Use `var_code` as the index (keys) for the dictionary, unless
# we're selecting `var_code`, in which case we can't set it as the
# index and use it for values
var_dict = (
{code: code for code in var_rows["var_code"].tolist()}
if values_to == "var_code"
else var_rows.copy().set_index("var_code")[values_to].to_dict()
)
if as_factor:
return pd.Categorical(
col.map(var_dict), categories=list(var_dict.values())
)
Comment on lines +276 to +278
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seemed like the closest analog to the as_factor behavior in R, but let me know if you have a different preference for how we should handle it.

else:
return col.map(var_dict)
return col

# Recode specified columns, or all columns if none were specified
cols = cols or data.columns
for var_name in cols:
if var_name in data.columns:
data[var_name] = transform_column(
data[var_name], var_name, values_to, as_factor
)

return data
17 changes: 15 additions & 2 deletions python/docs/source/reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,19 @@ Manage characteristics
^^^^^^^^^^^^^^^^^^^^^^

Recode/rename characteristic columns, merge HIE data, and fix characteristic
errors.
errors

:doc:`vars_rename() <vars_rename>`
:doc:`vars_rename() <vars_rename>` |nbsp|
:doc:`vars_recode() <vars_recode>`

Data
----

Dictionaries
^^^^^^^^^^^^

Lookups for numeric codes used in the assessment system
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: Also used for vars_rename, so not just for the numeric code lookup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, clarified in 2a82c96.


:doc:`vars_dict <vars_dict>`

.. |nbsp| unicode:: 0xA0
28 changes: 28 additions & 0 deletions python/docs/source/vars_dict.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
================================================
Data dictionary for CCAO data sets and variables
================================================
Comment on lines +1 to +3
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to use autodata to document this dict based on a docstring in the same way we do it in the R package. However, it seemed to be spitting out the docs for pandas.DataFrame regardless of how I configured it, so I gave up and just documented here.


A crosswalk of CCAO variable names used in iasWorld, AWS, modeling,
and open data. Also includes a translation of numeric character codes
to their human-readable value (ROOF_CNST = 1
becomes ROOF_CNST = Shingle/Asphalt).

Format
------

A pandas DataFrame with the following columns:

- **var_name_hie**: Column name of variable when stored in the legacy ADDCHARS SQL table.
- **var_name_iasworld**: Column name for variable as stored in the system of record (iasWorld).
- **var_name_athena**: Column name used for views and tables in AWS Athena.
- **var_name_model**: Column name used while data is flowing through modeling pipelines.
- **var_name_publish**: Human-readable column name used for public data sets.
- **var_name_pretty**: Human-readable column name used for publication and reporting.
- **var_type**: Variable type/prefix indicating the variable's function. For example,
``ind_`` variables are always indicators (booleans), while ``char_`` variables are
always property characteristics.
- **var_data_type**: R data type variable values should be stored as.
- **var_code**: Factor value for categorical variables. These are the values stored
in the system of record.
- **var_value**: Human-readable translation of factor value.
- **var_value_short**: Human-readable translation of factor value, but as short as possible.
5 changes: 5 additions & 0 deletions python/docs/source/vars_recode.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
==============================================================
Replace numerically coded variables with human-readable values
==============================================================

.. autofunction:: ccao.vars_recode
Loading
Loading