-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finish migration of vars_funs
module for Python package
#32
Changes from 48 commits
0231af4
a738b8b
adb87a6
c0739e8
6d63b51
f3f38e1
abdeca0
ec0a63e
08e418a
001b079
2aae031
4513bcb
44fd662
4070f6f
854aefc
8afc1d4
2549daf
723834e
c323370
9cc256b
eb9a619
bbbaf68
b0055b4
be12f3e
bd6835d
5eb1e23
1f7290e
1b6bbef
bca6864
5960235
aaebf7c
cd5bc3e
c8f4e49
81d8c20
b837865
848db78
e66eff3
442cf51
008966f
8caf3be
d132d95
204cccf
31dffc3
a8c3233
7505970
df3af62
e34b6d7
5637158
0e29e6f
b410f6e
d135b9a
2a82c96
f5ee577
67ea0bb
b9f300c
815b6a8
0890d98
ca15900
dcf038f
c2ab1ff
dd2d922
c771664
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -26,6 +26,7 @@ | |
^man-roxygen$ | ||
^pkgdown$ | ||
^public$ | ||
^python | ||
^renv$ | ||
^renv\.lock$ | ||
^vignettes$ | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
from ccao.vars_funs import vars_dict, vars_rename | ||
from ccao.vars_funs import vars_dict, vars_recode, vars_rename |
Original file line number | Diff line number | Diff line change | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -7,7 +7,7 @@ | |||||||||||
|
||||||||||||
# Load the default variable dictionary | ||||||||||||
_data_path = importlib.resources.files(ccao.data) | ||||||||||||
vars_dict = pd.read_csv(str(_data_path / "vars_dict.csv")) | ||||||||||||
vars_dict = pd.read_csv(str(_data_path / "vars_dict.csv"), dtype=str) | ||||||||||||
|
||||||||||||
# Prefix we use to identify variable name columns in the variable dictionary | ||||||||||||
VAR_NAME_PREFIX = "var_name" | ||||||||||||
|
@@ -126,3 +126,165 @@ def vars_rename( | |||||||||||
# If the input data is a list, it's not possible to update it inplace, | ||||||||||||
# so ignore that argument | ||||||||||||
return [mapping.get(col, col) for col in data] | ||||||||||||
|
||||||||||||
|
||||||||||||
def vars_recode( | ||||||||||||
data: pd.DataFrame, | ||||||||||||
cols: list[str] | None = None, | ||||||||||||
code_type: str = "long", | ||||||||||||
as_factor: bool = True, | ||||||||||||
dictionary: pd.DataFrame | None = None, | ||||||||||||
) -> pd.DataFrame: | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I copied over this interface and its docs pretty much as-is from the R package, just renaming the Lines 282 to 286 in 5be78a6
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd really like to keep the interfaces across the R and Python versions of our major packages the same. Maybe we can rename the R function inputs here and then release a major version as we did with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Makes sense, I'll take a stab at doing this in a fast follow PR. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Follow-up PR here: #34 |
||||||||||||
""" | ||||||||||||
Replace numerically coded variables with human-readable values. | ||||||||||||
|
||||||||||||
The system of record stores characteristic values in a numerically encoded | ||||||||||||
format. This function can be used to translate those values into a | ||||||||||||
human-readable format. For example, EXT_WALL = 2 will become | ||||||||||||
EXT_WALL = "Masonry". Note that the values and their translations | ||||||||||||
must be specified via a user-defined dictionary. The default dictionary is | ||||||||||||
:data:`vars_dict`. | ||||||||||||
|
||||||||||||
Options for ``code_type`` are: | ||||||||||||
|
||||||||||||
- ``"long"``, which transforms EXT_WALL = 1 to EXT_WALL = Frame | ||||||||||||
- ``"short"``, which transforms EXT_WALL = 1 to EXT_WALL = FRME | ||||||||||||
- ``"code"``, which keeps the original values (useful for removing | ||||||||||||
improperly coded values, see the note below) | ||||||||||||
|
||||||||||||
:param data: | ||||||||||||
A pandas DataFrame with columns to have values replaced. | ||||||||||||
:type data: pandas.DataFrame | ||||||||||||
|
||||||||||||
:param cols: | ||||||||||||
A list of column names to be transformed, or ``None`` to select all columns. | ||||||||||||
:type cols: list[str] | ||||||||||||
|
||||||||||||
:param code_type: | ||||||||||||
The recoding type. See description above for options. | ||||||||||||
:type code_type: str | ||||||||||||
|
||||||||||||
:param as_factor: | ||||||||||||
If True, re-encoded values will be returned as categorical variables | ||||||||||||
(pandas Categorical). | ||||||||||||
If False, re-encoded values will be returned as plain strings. | ||||||||||||
:type as_factor: bool | ||||||||||||
|
||||||||||||
:param dictionary: | ||||||||||||
A pandas DataFrame representing the dictionary used to translate | ||||||||||||
encodings. | ||||||||||||
:type dictionary: pandas.DataFrame | ||||||||||||
|
||||||||||||
:raises ValueError: | ||||||||||||
If the dictionary is missing required columns or if invalid input is | ||||||||||||
provided. | ||||||||||||
|
||||||||||||
:return: | ||||||||||||
The input DataFrame with re-encoded values for the specified columns. | ||||||||||||
:rtype: pandas.DataFrame | ||||||||||||
|
||||||||||||
.. note:: | ||||||||||||
Values which are in the data but are NOT in the dictionary will be | ||||||||||||
converted to NaN. | ||||||||||||
|
||||||||||||
:example: | ||||||||||||
|
||||||||||||
.. code-block:: python | ||||||||||||
|
||||||||||||
import ccao | ||||||||||||
|
||||||||||||
sample_data = ccao.sample_athena | ||||||||||||
|
||||||||||||
# Defaults to `long` code type | ||||||||||||
ccao.vars_recode(data=sample_data) | ||||||||||||
|
||||||||||||
# Recode to `short` code type | ||||||||||||
ccao.vars_recode(data=sample_data, code_type="short") | ||||||||||||
|
||||||||||||
# Recode only specified columns | ||||||||||||
ccao.vars_recode(data=sample_data, cols="GAR1_SIZE") | ||||||||||||
""" | ||||||||||||
# Validate the dictionary schema | ||||||||||||
dictionary = dictionary if dictionary is not None else vars_dict | ||||||||||||
if dictionary.empty: | ||||||||||||
raise ValueError("dictionary must be a non-empty pandas DataFrame") | ||||||||||||
|
||||||||||||
required_columns = { | ||||||||||||
"var_code", | ||||||||||||
"var_value", | ||||||||||||
"var_value_short", | ||||||||||||
"var_type", | ||||||||||||
"var_data_type", | ||||||||||||
} | ||||||||||||
if not required_columns.issubset(dictionary.columns): | ||||||||||||
raise ValueError( | ||||||||||||
"Input dictionary must contain the following columns: " | ||||||||||||
f"{', '.join(required_columns)}" | ||||||||||||
) | ||||||||||||
|
||||||||||||
if not any(col.startswith("var_name_") for col in dictionary.columns): | ||||||||||||
raise ValueError( | ||||||||||||
"Input dictionary must contain at least one var_name_ column" | ||||||||||||
) | ||||||||||||
|
||||||||||||
if code_type not in ["short", "long", "code"]: | ||||||||||||
raise ValueError("code_type must be one of 'short', 'long', or 'code'") | ||||||||||||
|
||||||||||||
# Filter the dictionary for categoricals only and and pivot it longer for | ||||||||||||
# easier lookup | ||||||||||||
dict_long = dictionary[ | ||||||||||||
(dictionary["var_type"] == "char") | ||||||||||||
& (dictionary["var_data_type"] == "categorical") | ||||||||||||
] | ||||||||||||
dict_long = dict_long.melt( | ||||||||||||
id_vars=["var_code", "var_value", "var_value_short"], | ||||||||||||
value_vars=[ | ||||||||||||
col for col in dictionary.columns if col.startswith("var_name_") | ||||||||||||
], | ||||||||||||
value_name="var_name", | ||||||||||||
var_name="var_type", | ||||||||||||
) | ||||||||||||
dict_long_pkey = ["var_code", "var_value", "var_value_short", "var_name"] | ||||||||||||
dict_long = dict_long[dict_long_pkey] | ||||||||||||
dict_long = dict_long.drop_duplicates(subset=dict_long_pkey) | ||||||||||||
|
||||||||||||
# Map the code type to its internal representation in the dictionary | ||||||||||||
values_to = { | ||||||||||||
"code": "var_code", | ||||||||||||
"long": "var_value", | ||||||||||||
"short": "var_value_short", | ||||||||||||
}[code_type] | ||||||||||||
|
||||||||||||
# Function to apply to each column to remap column values based on the | ||||||||||||
# vars dict | ||||||||||||
def transform_column( | ||||||||||||
col: pd.Series, var_name: str, values_to: str, as_factor: bool | ||||||||||||
) -> pd.Series | pd.Categorical: | ||||||||||||
if var_name in dict_long["var_name"].values: | ||||||||||||
var_rows = dict_long[dict_long["var_name"] == var_name] | ||||||||||||
# Get a dictionary mapping the possible codes to their values. | ||||||||||||
# Use `var_code` as the index (keys) for the dictionary, unless | ||||||||||||
# we're selecting `var_code`, in which case we can't set it as the | ||||||||||||
# index and use it for values | ||||||||||||
var_dict = ( | ||||||||||||
{code: code for code in var_rows["var_code"].tolist()} | ||||||||||||
if values_to == "var_code" | ||||||||||||
else var_rows.copy().set_index("var_code")[values_to].to_dict() | ||||||||||||
) | ||||||||||||
if as_factor: | ||||||||||||
return pd.Categorical( | ||||||||||||
col.map(var_dict), categories=list(var_dict.values()) | ||||||||||||
) | ||||||||||||
Comment on lines
+276
to
+278
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seemed like the closest analog to the |
||||||||||||
else: | ||||||||||||
return col.map(var_dict) | ||||||||||||
return col | ||||||||||||
|
||||||||||||
# Recode specified columns, or all columns if none were specified | ||||||||||||
cols = cols or data.columns | ||||||||||||
for var_name in cols: | ||||||||||||
if var_name in data.columns: | ||||||||||||
data[var_name] = transform_column( | ||||||||||||
data[var_name], var_name, values_to, as_factor | ||||||||||||
) | ||||||||||||
|
||||||||||||
return data |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
================================================ | ||
Data dictionary for CCAO data sets and variables | ||
================================================ | ||
Comment on lines
+1
to
+3
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I tried to use autodata to document this dict based on a docstring in the same way we do it in the R package. However, it seemed to be spitting out the docs for |
||
|
||
A crosswalk of CCAO variable names used in iasWorld, AWS, modeling, | ||
and open data. Also includes a translation of numeric character codes | ||
to their human-readable value (ROOF_CNST = 1 | ||
becomes ROOF_CNST = Shingle/Asphalt). | ||
|
||
Format | ||
------ | ||
|
||
A pandas DataFrame with the following columns: | ||
|
||
- **var_name_hie**: Column name of variable when stored in the legacy ADDCHARS SQL table. | ||
- **var_name_iasworld**: Column name for variable as stored in the system of record (iasWorld). | ||
- **var_name_athena**: Column name used for views and tables in AWS Athena. | ||
- **var_name_model**: Column name used while data is flowing through modeling pipelines. | ||
- **var_name_publish**: Human-readable column name used for public data sets. | ||
- **var_name_pretty**: Human-readable column name used for publication and reporting. | ||
- **var_type**: Variable type/prefix indicating the variable's function. For example, | ||
``ind_`` variables are always indicators (booleans), while ``char_`` variables are | ||
always property characteristics. | ||
- **var_data_type**: R data type variable values should be stored as. | ||
- **var_code**: Factor value for categorical variables. These are the values stored | ||
in the system of record. | ||
- **var_value**: Human-readable translation of factor value. | ||
- **var_value_short**: Human-readable translation of factor value, but as short as possible. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
============================================================== | ||
Replace numerically coded variables with human-readable values | ||
============================================================== | ||
|
||
.. autofunction:: ccao.vars_recode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifying a
str
dtype is necessary here, since otherwise code columns get interpreted as floats by default. We could restrict this type inference to only thevar_code
column, but it seems like everything in this dict should be a string anyway, so I figure we may as well make the type explicit for all columns.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (non-blocking): Won't this prevent matching to numeric values in the original data? i.e.
1
instead of"1"
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically yes, but I think that we ingest all codes as strings when pulling from ias, right? Note that this assumption is currently baked into the R package as well:
ccao/data-raw/vars_dict.R
Lines 1 to 4 in 5be78a6