13 add gold mcs epc data #17

crispy-wonton · 2023-11-02T18:14:16Z

Fixes #16

Changes

Update to run analysis with both enhanced MCS and now also with gold MCS-EPC merged dataset for comparison of cumulative numbers and trends.
Add new function to preprocess gold data.
Update get_enhanced_mcs function to allow it to enhance MCS or the gold data
Add new figures for cumulative installations in total, and by rurality and gas status for gold data.

Instructions for Reviewer

I have merged changes from branch 10 to check the new data works with the total cumulative installations graph. I've labelled where code has already been reviewed in branch 10 and doesnt need to be reviewed again and have left comments to indicate new code to review.
Please could you check the pre-processing step of the gold dataset (load_mcs_epc_combined function) and ensure it is sensible. The pre-processed csv file is created by that same methodology.
Please could you sense check the get_enhanced_combined function (any suggestions for a better name welcome) to see if you agree that it's processing the data as expected - i.e. in the same way it was processing enhanced MCS by adding postcode, rurality, and gas columns.

Checklist:

improve and correct set up instructions and update output file extensions to html.

update local_data_dir to use relative file path from config base.yaml so config file can be updated by users rather than editing scripts.

change output graphs from produce_plots.py from .png to .html files to avoid altair CalledProcessError

- add argparser to allow user to specify local_data_dir, epc-, and mcs-batch - add function to load MCS data directly from S3 bucket - add function to check for EPC data locally and download from S3 if not located - update existing functions to work with above - add global parameters for EPC processing version to base.yaml config file

update instructions in README to explain new way of running script and to record batches for historical analyses

add script to generate stats and charts from asf_senedd_response into produce_plots.py

merge getters from loading.py in asf_senedd_response into get_data.py script

update EPC getter function name to match updates in get_data.py

add files: - translation_config for welsh translations - plotting.py for plotting functions - augmenting.py for data processing and augmentation

…repo

update to prevent module not subscriptable error

improve selection of min and max dates

allows user to specify which directory with supplementary data to use, meaning new supplementary data can be added to analysis as it's updated

- update README with correct file structure and new output - move translation dicts to translation_config.py - write summary stats into output .txt file instead of printing to console - correct new dwelling labels - update function names and documentation for clarity

- use logging instead of print - log info for all charts when saved - generalise calculation of max date for graphs - remove unused config

…to one EPC record

…param

…data also change graphs to nesta blue

…_installations

- add args to specify merged dataset batch and whether to download from s3 - add function to get gold data from local file or download from s3 and conduct preprocessing

- add function to merge gold data with postcode, rurality, and gas data - adapt cumsums by variable function to work with gold data

…re relevant

…s' into test_combined_data

crispy-wonton · 2023-11-02T18:22:27Z

asf_welsh_energy_consultation/getters/get_data.py

@@ -485,3 +503,73 @@ def load_wales_hp(wales_epc):
    wales_hp = wales_epc.loc[wales_epc.HP_INSTALLED].reset_index(drop=True)

    return wales_hp
+
+
+def load_mcs_epc_combined():


This function is the main part to review

crispy-wonton · 2023-11-02T18:23:08Z

asf_welsh_energy_consultation/pipeline/process_data.py

+    return df
+
+
+def get_total_cumsums(data, installation_date_col):


This function is merged from branch 10 which has already been reviewed - no need to review here

just noticed the docstrings are missing the args info

Args: data:... installation_date_col:...

crispy-wonton · 2023-11-02T18:24:00Z

asf_welsh_energy_consultation/pipeline/plotting.py

No need to review this file - This is merged from branch 10 which has already been reviewed

crispy-wonton · 2023-11-02T18:25:01Z

README.md

No need to review this file - This is merged from branch 10 which has already been reviewed

crispy-wonton · 2023-11-02T18:25:29Z

asf_welsh_energy_consultation/analysis/produce_plots_and_stats.py

+    # Total MCS installations
+    enhanced_mcs = process_data.get_enhanced_combined(mcs_or_gold="mcs")
+    total_cumulative_installations = process_data.get_total_cumsums(
+        data=enhanced_mcs, installation_date_col="commission_date"
+    )
+
+    total_cumulative_installations_chart = time_series_comparison(
+        data=total_cumulative_installations,
+        title="Cumulative MCS certified heat pump installations over time",
+        y_var="cumsum:Q",
+        y_title="Number of heat pump installations",
+        color_var="colour:N",
+        filename="total_cumulative_installations",
+        output_dir=output_folder,
+    )


No need to review this part - This is merged from branch 10 which has already been reviewed

crispy-wonton · 2023-11-02T18:25:55Z

asf_welsh_energy_consultation/analysis/produce_plots_and_stats.py

@@ -74,6 +97,70 @@
        output_dir=output_folder,
    )

+    # ======================================================


From here are the new graphs that use MCS -EPC gold

sofiapinto

Hi @crispy-wonton Roisin,

I've looked at the code (and specifically at the functions you mentioned in your PR summary) and haven't found anything problematic with the logic - looks good! I didn't have time to run the code though, so might be worth asking someone else to run it tomorrow just to check the code runs smoothly for someone else. If no one is available, I can run it on Monday.

I think you're missing dask from the requirements.txt but because I didn't run the code I don't know if it runs without it.

sofiapinto · 2023-11-02T21:30:37Z

asf_welsh_energy_consultation/pipeline/process_data.py

+    return df
+
+
+def get_total_cumsums(data, installation_date_col):


just noticed the docstrings are missing the args info

Args: data:... installation_date_col:...

sofiapinto · 2023-11-02T21:31:56Z

asf_welsh_energy_consultation/pipeline/process_data.py



-def cumsums_by_variable(variable, new_var_name, data=enhanced_mcs):
+def cumsums_by_variable(
+    variable, new_var_name, data, installation_date_col="HP_INSTALL_DATE"


Any reason why you chose to have the HP_INSTALL_DATE as default instead of commision_date?

No reason, can be either.

sofiapinto · 2023-11-02T21:34:01Z

asf_welsh_energy_consultation/getters/get_data.py

+        if batch == "231009":
+            df = df[df["HP_INSTALL_DATE"] < "2023-07-01"]


Instead you could just check max date in EPC, max date in MCS and get the min between the two max's :) it would avoid the hardcoding. Also because this might happen in future batches.

Suggested change

if batch == "231009":

df = df[df["HP_INSTALL_DATE"] < "2023-07-01"]

max_epc_date = df["HP_INSTALL_DATE"].max()

max_mcs_date = df["commision_date"].max()

df = df[df["HP_INSTALL_DATE"] <= min(max_epc_date, max_mcs_date)]

Great idea but it doesn't look like 'commission_date' col exists in the dataset. I believe the processing combines all installation/commission dates into the HP_INSTALL_DATE col

asf_welsh_energy_consultation/getters/get_data.py

sofiapinto · 2023-11-02T21:39:34Z

asf_welsh_energy_consultation/getters/get_data.py

+                "HP_INSTALL_DATE": "object",
+                "UPRN": "object",
+                "installation_type": "object",


why are you setting these as object?

asf_welsh_energy_consultation/getters/get_data.py

sofiapinto · 2023-11-02T21:47:21Z

asf_welsh_energy_consultation/analysis/produce_plots_and_stats.py

-    total_hp = f"Number of heat pumps: {len(wales_hp)}\n"
-    total_epc = f"Number of properties in EPC: {len(wales_df)}\n"
-    hp_perc = "Estimated percentage of properties with a heat pump: \
+    total_epc_hp = f"Number of heat pumps in EPC: {len(wales_hp)}\n"


is this epc or epc + mcs?

sofiapinto · 2023-11-02T21:47:32Z

asf_welsh_energy_consultation/analysis/produce_plots_and_stats.py

-    hp_perc = "Estimated percentage of properties with a heat pump: \
+    total_epc_hp = f"Number of heat pumps in EPC: {len(wales_hp)}\n"
+    total_epc_properties = f"Number of properties in EPC: {len(wales_df)}\n"
+    hp_perc = "Estimated percentage of EPC properties with a heat pump: \


again just epc or epc + mcs?

…ced_combined

crispy-wonton and others added 30 commits September 21, 2023 17:54

update README

cfe2149

improve and correct set up instructions and update output file extensions to html.

update local_data_dir in getters

aec3c3e

update local_data_dir to use relative file path from config base.yaml so config file can be updated by users rather than editing scripts.

update requirements.txt with versions

a7be30d

change outputs to .html files

60b59bc

change output graphs from produce_plots.py from .png to .html files to avoid altair CalledProcessError

fix introduced typo in requirements.txt

a56f765

update README

a37451a

update instructions in README to explain new way of running script and to record batches for historical analyses

update requirements.txt with argpase

43fd319

merge asf_senedd_response wales_analysis.py

cd6d1eb

add script to generate stats and charts from asf_senedd_response into produce_plots.py

add global parameters to config file

ba2dc7a

merge asf_senedd_response getters from loading.py

ecebb72

merge getters from loading.py in asf_senedd_response into get_data.py script

update function name

3c5d597

update EPC getter function name to match updates in get_data.py

add new files from asf_senedd_response repo

8768231

add files: - translation_config for welsh translations - plotting.py for plotting functions - augmenting.py for data processing and augmentation

resolve merge conflits - merge branch 'dev' into 03_merge_asf_senedd_…

465e997

…repo

update config variable name in __init__.py

31cf6d1

update to prevent module not subscriptable error

fix minor errors in produce_plots

bfe9ea3

use new config_file variable name

3a765f0

update domain min and max variables in time_series_comparison

fd5d6e6

improve selection of min and max dates

update README with new output files

7518c22

update plot max dates for newest EPC/MCS batches

1f29a92

add supp_data arg

97899cd

allows user to specify which directory with supplementary data to use, meaning new supplementary data can be added to analysis as it's updated

delete formatting file and merge augmenting with process_data.py

b2d9361

add logging and improve how default domain min and max is determined

f79bd33

update import statements to remove import *

84694fa

update documentation

71b1ebb

make improvements suggested in code review

d935d63

- use logging instead of print - log info for all charts when saved - generalise calculation of max date for graphs - remove unused config

remove duplicate UPRNs - indicates multiple MCS installations joined …

5c30dfd

…to one EPC record

update get_mcs_and_joined_data function to allow passing epc_version …

7264776

…param

update postcode df to reduce row loss when adding country col to MCS …

1636960

…data also change graphs to nesta blue

crispy-wonton added 13 commits October 25, 2023 13:34

add new total cumulative MCS installations plot

2792329

Fix merge conflicts and merge branch 'dev' into 10_add_cumulative_mcs…

b80d5da

…_installations

remove unnecessary data paths not addressed in merge

4bff00c

add note for clarity

ae42bc6

add new graph file to outputs section in README

5273ab7

update cumulative installations graph name to be more descriptive

0855652

get new gold merged dataset

cee0675

- add args to specify merged dataset batch and whether to download from s3 - add function to get gold data from local file or download from s3 and conduct preprocessing

get enhanced gold data

be738e3

- add function to merge gold data with postcode, rurality, and gas data - adapt cumsums by variable function to work with gold data

update produce plots to use new gold data instead of enhanced mcs whe…

95493ad

…re relevant

Fix merge conflict - merge branch '10_add_cumulative_mcs_installation…

37bd4d3

…s' into test_combined_data

add gold MCS EPC figures

53ecce2

change enhanced_mcs function to work for MCS and gold data

9b1b6e4

fix file path

4b7ce75

crispy-wonton requested a review from sofiapinto November 2, 2023 18:22

crispy-wonton commented Nov 2, 2023

View reviewed changes

README.md Outdated

Copy link

Contributor Author

crispy-wonton Nov 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to review this file - This is merged from branch 10 which has already been reviewed

crispy-wonton commented Nov 2, 2023

View reviewed changes

crispy-wonton marked this pull request as ready for review November 2, 2023 18:28

sofiapinto reviewed Nov 2, 2023

View reviewed changes

crispy-wonton added 6 commits November 3, 2023 12:25

add new gold args and output files to readme

0cfe535

update requirements to add dask and others

6e8ed9b

fix merge conflicts - merge branch 'dev' into 13_add_gold_mcs_epc_data

430d848

add gold batch to october 23 analysis in readme

6437ad1

remove duplicate code and update get_enhanced_mcs to get_combined_mcs

a2db691

add missing args to function and update get_enhanced_mcs to get_enhan…

4352af1

…ced_combined

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

13 add gold mcs epc data #17

13 add gold mcs epc data #17

crispy-wonton commented Nov 2, 2023 •

edited

Loading

crispy-wonton Nov 2, 2023

crispy-wonton Nov 2, 2023 •

edited

Loading

sofiapinto Nov 2, 2023

crispy-wonton Nov 2, 2023

crispy-wonton Nov 2, 2023

crispy-wonton Nov 2, 2023

crispy-wonton Nov 2, 2023

sofiapinto left a comment

sofiapinto Nov 2, 2023

sofiapinto Nov 2, 2023

crispy-wonton Nov 3, 2023

sofiapinto Nov 2, 2023

crispy-wonton Nov 3, 2023

sofiapinto Nov 2, 2023

sofiapinto Nov 2, 2023

crispy-wonton Nov 3, 2023

sofiapinto Nov 2, 2023

crispy-wonton Nov 3, 2023

		return df


		def get_total_cumsums(data, installation_date_col):

		if batch == "231009":
		df = df[df["HP_INSTALL_DATE"] < "2023-07-01"]

-        if batch == "231009":
-            df = df[df["HP_INSTALL_DATE"] < "2023-07-01"]
+        max_epc_date = df["HP_INSTALL_DATE"].max()
+        max_mcs_date = df["commision_date"].max()
+        df = df[df["HP_INSTALL_DATE"] <= min(max_epc_date, max_mcs_date)]

13 add gold mcs epc data #17

Are you sure you want to change the base?

13 add gold mcs epc data #17

Conversation

crispy-wonton commented Nov 2, 2023 • edited Loading

Changes

Instructions for Reviewer

Checklist:

Choose a reason for hiding this comment

crispy-wonton Nov 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sofiapinto left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crispy-wonton commented Nov 2, 2023 •

edited

Loading

crispy-wonton Nov 2, 2023 •

edited

Loading