-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
13 add gold mcs epc data #17
base: dev
Are you sure you want to change the base?
Conversation
improve and correct set up instructions and update output file extensions to html.
update local_data_dir to use relative file path from config base.yaml so config file can be updated by users rather than editing scripts.
change output graphs from produce_plots.py from .png to .html files to avoid altair CalledProcessError
- add argparser to allow user to specify local_data_dir, epc-, and mcs-batch - add function to load MCS data directly from S3 bucket - add function to check for EPC data locally and download from S3 if not located - update existing functions to work with above - add global parameters for EPC processing version to base.yaml config file
update instructions in README to explain new way of running script and to record batches for historical analyses
add script to generate stats and charts from asf_senedd_response into produce_plots.py
merge getters from loading.py in asf_senedd_response into get_data.py script
update EPC getter function name to match updates in get_data.py
add files: - translation_config for welsh translations - plotting.py for plotting functions - augmenting.py for data processing and augmentation
update to prevent module not subscriptable error
improve selection of min and max dates
allows user to specify which directory with supplementary data to use, meaning new supplementary data can be added to analysis as it's updated
- update README with correct file structure and new output - move translation dicts to translation_config.py - write summary stats into output .txt file instead of printing to console - correct new dwelling labels - update function names and documentation for clarity
- use logging instead of print - log info for all charts when saved - generalise calculation of max date for graphs - remove unused config
…to one EPC record
…data also change graphs to nesta blue
- add args to specify merged dataset batch and whether to download from s3 - add function to get gold data from local file or download from s3 and conduct preprocessing
- add function to merge gold data with postcode, rurality, and gas data - adapt cumsums by variable function to work with gold data
…s' into test_combined_data
@@ -485,3 +503,73 @@ def load_wales_hp(wales_epc): | |||
wales_hp = wales_epc.loc[wales_epc.HP_INSTALLED].reset_index(drop=True) | |||
|
|||
return wales_hp | |||
|
|||
|
|||
def load_mcs_epc_combined(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is the main part to review
return df | ||
|
||
|
||
def get_total_cumsums(data, installation_date_col): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is merged from branch 10 which has already been reviewed - no need to review here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just noticed the docstrings are missing the args info
Args:
data:...
installation_date_col:...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to review this file - This is merged from branch 10 which has already been reviewed
README.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to review this file - This is merged from branch 10 which has already been reviewed
# Total MCS installations | ||
enhanced_mcs = process_data.get_enhanced_combined(mcs_or_gold="mcs") | ||
total_cumulative_installations = process_data.get_total_cumsums( | ||
data=enhanced_mcs, installation_date_col="commission_date" | ||
) | ||
|
||
total_cumulative_installations_chart = time_series_comparison( | ||
data=total_cumulative_installations, | ||
title="Cumulative MCS certified heat pump installations over time", | ||
y_var="cumsum:Q", | ||
y_title="Number of heat pump installations", | ||
color_var="colour:N", | ||
filename="total_cumulative_installations", | ||
output_dir=output_folder, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to review this part - This is merged from branch 10 which has already been reviewed
@@ -74,6 +97,70 @@ | |||
output_dir=output_folder, | |||
) | |||
|
|||
# ====================================================== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From here are the new graphs that use MCS -EPC gold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @crispy-wonton Roisin,
I've looked at the code (and specifically at the functions you mentioned in your PR summary) and haven't found anything problematic with the logic - looks good! I didn't have time to run the code though, so might be worth asking someone else to run it tomorrow just to check the code runs smoothly for someone else. If no one is available, I can run it on Monday.
I think you're missing dask from the requirements.txt
but because I didn't run the code I don't know if it runs without it.
return df | ||
|
||
|
||
def get_total_cumsums(data, installation_date_col): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just noticed the docstrings are missing the args info
Args:
data:...
installation_date_col:...
|
||
|
||
def cumsums_by_variable(variable, new_var_name, data=enhanced_mcs): | ||
def cumsums_by_variable( | ||
variable, new_var_name, data, installation_date_col="HP_INSTALL_DATE" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why you chose to have the HP_INSTALL_DATE as default instead of commision_date?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason, can be either.
if batch == "231009": | ||
df = df[df["HP_INSTALL_DATE"] < "2023-07-01"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead you could just check max date in EPC, max date in MCS and get the min between the two max's :) it would avoid the hardcoding. Also because this might happen in future batches.
if batch == "231009": | |
df = df[df["HP_INSTALL_DATE"] < "2023-07-01"] | |
max_epc_date = df["HP_INSTALL_DATE"].max() | |
max_mcs_date = df["commision_date"].max() | |
df = df[df["HP_INSTALL_DATE"] <= min(max_epc_date, max_mcs_date)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea but it doesn't look like 'commission_date' col exists in the dataset. I believe the processing combines all installation/commission dates into the HP_INSTALL_DATE col
"HP_INSTALL_DATE": "object", | ||
"UPRN": "object", | ||
"installation_type": "object", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you setting these as object?
total_hp = f"Number of heat pumps: {len(wales_hp)}\n" | ||
total_epc = f"Number of properties in EPC: {len(wales_df)}\n" | ||
hp_perc = "Estimated percentage of properties with a heat pump: \ | ||
total_epc_hp = f"Number of heat pumps in EPC: {len(wales_hp)}\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this epc or epc + mcs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EPC only
hp_perc = "Estimated percentage of properties with a heat pump: \ | ||
total_epc_hp = f"Number of heat pumps in EPC: {len(wales_hp)}\n" | ||
total_epc_properties = f"Number of properties in EPC: {len(wales_df)}\n" | ||
hp_perc = "Estimated percentage of EPC properties with a heat pump: \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again just epc or epc + mcs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just EPC
Fixes #16
Changes
Instructions for Reviewer
load_mcs_epc_combined
function) and ensure it is sensible. The pre-processed csv file is created by that same methodology.get_enhanced_combined
function (any suggestions for a better name welcome) to see if you agree that it's processing the data as expected - i.e. in the same way it was processing enhanced MCS by adding postcode, rurality, and gas columns.Checklist:
notebooks/
pre-commit
and addressed any issues not automatically fixeddev
README
s