Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

13 add gold mcs epc data #17

Open
wants to merge 49 commits into
base: dev
Choose a base branch
from
Open

Conversation

crispy-wonton
Copy link
Contributor

@crispy-wonton crispy-wonton commented Nov 2, 2023

Fixes #16

Changes

  • Update to run analysis with both enhanced MCS and now also with gold MCS-EPC merged dataset for comparison of cumulative numbers and trends.
  • Add new function to preprocess gold data.
  • Update get_enhanced_mcs function to allow it to enhance MCS or the gold data
  • Add new figures for cumulative installations in total, and by rurality and gas status for gold data.

Instructions for Reviewer

  • I have merged changes from branch 10 to check the new data works with the total cumulative installations graph. I've labelled where code has already been reviewed in branch 10 and doesnt need to be reviewed again and have left comments to indicate new code to review.
  • Please could you check the pre-processing step of the gold dataset (load_mcs_epc_combined function) and ensure it is sensible. The pre-processed csv file is created by that same methodology.
  • Please could you sense check the get_enhanced_combined function (any suggestions for a better name welcome) to see if you agree that it's processing the data as expected - i.e. in the same way it was processing enhanced MCS by adding postcode, rurality, and gas columns.

Checklist:

  • I have refactored my code out from notebooks/
  • I have checked the code runs
  • I have tested the code
  • I have run pre-commit and addressed any issues not automatically fixed
  • I have merged any new changes from dev
  • I have documented the code
    • Major functions have docstrings
    • Appropriate information has been added to READMEs
  • I have explained this PR above
  • I have requested a code review

crispy-wonton and others added 30 commits September 21, 2023 17:54
improve and correct set up instructions and update output file extensions to html.
update local_data_dir to use relative file path from config base.yaml so config file can be updated by users rather than editing scripts.
change output graphs from produce_plots.py from .png to .html files to avoid altair CalledProcessError
- add argparser to allow user to specify local_data_dir, epc-, and mcs-batch
- add function to load MCS data directly from S3 bucket
- add function to check for EPC data locally and download from S3 if not located
- update existing functions to work with above
- add global parameters for EPC processing version to base.yaml config file
update instructions in README to explain new way of running script and to record batches for historical analyses
add script to generate stats and charts from asf_senedd_response into produce_plots.py
merge getters from loading.py in asf_senedd_response into get_data.py script
update EPC getter function name to match updates in get_data.py
add files:
- translation_config for welsh translations
- plotting.py for plotting functions
- augmenting.py for data processing and augmentation
update to prevent module not subscriptable error
allows user to specify which directory with supplementary data to use, meaning new supplementary data can be added to analysis as it's updated
- update README with correct file structure and new output
- move translation dicts to translation_config.py
- write summary stats into output .txt file instead of printing to console
- correct new dwelling labels
- update function names and documentation for clarity
- use logging instead of print
- log info for all charts when saved
- generalise calculation of max date for graphs
- remove unused config
@@ -485,3 +503,73 @@ def load_wales_hp(wales_epc):
wales_hp = wales_epc.loc[wales_epc.HP_INSTALLED].reset_index(drop=True)

return wales_hp


def load_mcs_epc_combined():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is the main part to review

return df


def get_total_cumsums(data, installation_date_col):
Copy link
Contributor Author

@crispy-wonton crispy-wonton Nov 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is merged from branch 10 which has already been reviewed - no need to review here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just noticed the docstrings are missing the args info

Args:
   data:...
   installation_date_col:...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to review this file - This is merged from branch 10 which has already been reviewed

README.md Outdated
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to review this file - This is merged from branch 10 which has already been reviewed

Comment on lines 37 to 51
# Total MCS installations
enhanced_mcs = process_data.get_enhanced_combined(mcs_or_gold="mcs")
total_cumulative_installations = process_data.get_total_cumsums(
data=enhanced_mcs, installation_date_col="commission_date"
)

total_cumulative_installations_chart = time_series_comparison(
data=total_cumulative_installations,
title="Cumulative MCS certified heat pump installations over time",
y_var="cumsum:Q",
y_title="Number of heat pump installations",
color_var="colour:N",
filename="total_cumulative_installations",
output_dir=output_folder,
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to review this part - This is merged from branch 10 which has already been reviewed

@@ -74,6 +97,70 @@
output_dir=output_folder,
)

# ======================================================
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From here are the new graphs that use MCS -EPC gold

@crispy-wonton crispy-wonton marked this pull request as ready for review November 2, 2023 18:28
Copy link
Contributor

@sofiapinto sofiapinto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @crispy-wonton Roisin,

I've looked at the code (and specifically at the functions you mentioned in your PR summary) and haven't found anything problematic with the logic - looks good! I didn't have time to run the code though, so might be worth asking someone else to run it tomorrow just to check the code runs smoothly for someone else. If no one is available, I can run it on Monday.

I think you're missing dask from the requirements.txt but because I didn't run the code I don't know if it runs without it.

return df


def get_total_cumsums(data, installation_date_col):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just noticed the docstrings are missing the args info

Args:
   data:...
   installation_date_col:...



def cumsums_by_variable(variable, new_var_name, data=enhanced_mcs):
def cumsums_by_variable(
variable, new_var_name, data, installation_date_col="HP_INSTALL_DATE"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why you chose to have the HP_INSTALL_DATE as default instead of commision_date?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason, can be either.

Comment on lines +568 to +569
if batch == "231009":
df = df[df["HP_INSTALL_DATE"] < "2023-07-01"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead you could just check max date in EPC, max date in MCS and get the min between the two max's :) it would avoid the hardcoding. Also because this might happen in future batches.

Suggested change
if batch == "231009":
df = df[df["HP_INSTALL_DATE"] < "2023-07-01"]
max_epc_date = df["HP_INSTALL_DATE"].max()
max_mcs_date = df["commision_date"].max()
df = df[df["HP_INSTALL_DATE"] <= min(max_epc_date, max_mcs_date)]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea but it doesn't look like 'commission_date' col exists in the dataset. I believe the processing combines all installation/commission dates into the HP_INSTALL_DATE col

Comment on lines +537 to +539
"HP_INSTALL_DATE": "object",
"UPRN": "object",
"installation_type": "object",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you setting these as object?

total_hp = f"Number of heat pumps: {len(wales_hp)}\n"
total_epc = f"Number of properties in EPC: {len(wales_df)}\n"
hp_perc = "Estimated percentage of properties with a heat pump: \
total_epc_hp = f"Number of heat pumps in EPC: {len(wales_hp)}\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this epc or epc + mcs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EPC only

hp_perc = "Estimated percentage of properties with a heat pump: \
total_epc_hp = f"Number of heat pumps in EPC: {len(wales_hp)}\n"
total_epc_properties = f"Number of properties in EPC: {len(wales_df)}\n"
hp_perc = "Estimated percentage of EPC properties with a heat pump: \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again just epc or epc + mcs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just EPC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add gold merged MCS-EPC data into analysis
2 participants