Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

13 add gold mcs epc data #17

Open
wants to merge 49 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
cfe2149
update README
crispy-wonton Sep 21, 2023
aec3c3e
update local_data_dir in getters
crispy-wonton Sep 21, 2023
a7be30d
update requirements.txt with versions
crispy-wonton Sep 21, 2023
60b59bc
change outputs to .html files
crispy-wonton Sep 21, 2023
a56f765
fix introduced typo in requirements.txt
crispy-wonton Sep 22, 2023
cf57335
automate all data downloads/loading in script
crispy-wonton Oct 6, 2023
a37451a
update README
crispy-wonton Oct 6, 2023
43fd319
update requirements.txt with argpase
crispy-wonton Oct 6, 2023
cd6d1eb
merge asf_senedd_response wales_analysis.py
crispy-wonton Oct 10, 2023
ba2dc7a
add global parameters to config file
crispy-wonton Oct 10, 2023
ecebb72
merge asf_senedd_response getters from loading.py
crispy-wonton Oct 10, 2023
3c5d597
update function name
crispy-wonton Oct 10, 2023
8768231
add new files from asf_senedd_response repo
crispy-wonton Oct 10, 2023
465e997
resolve merge conflits - merge branch 'dev' into 03_merge_asf_senedd_…
crispy-wonton Oct 13, 2023
31cf6d1
update config variable name in __init__.py
crispy-wonton Oct 13, 2023
bfe9ea3
fix minor errors in produce_plots
crispy-wonton Oct 13, 2023
3a765f0
use new config_file variable name
crispy-wonton Oct 13, 2023
fd5d6e6
update domain min and max variables in time_series_comparison
crispy-wonton Oct 13, 2023
7518c22
update README with new output files
crispy-wonton Oct 13, 2023
1f29a92
update plot max dates for newest EPC/MCS batches
crispy-wonton Oct 13, 2023
97899cd
add supp_data arg
crispy-wonton Oct 13, 2023
b2d9361
delete formatting file and merge augmenting with process_data.py
crispy-wonton Oct 13, 2023
f79bd33
add logging and improve how default domain min and max is determined
crispy-wonton Oct 16, 2023
84694fa
update import statements to remove import *
crispy-wonton Oct 16, 2023
71af7e1
make changes requested in review
crispy-wonton Oct 20, 2023
71b1ebb
update documentation
crispy-wonton Oct 20, 2023
d935d63
make improvements suggested in code review
crispy-wonton Oct 23, 2023
5c30dfd
remove duplicate UPRNs - indicates multiple MCS installations joined …
crispy-wonton Oct 24, 2023
7264776
update get_mcs_and_joined_data function to allow passing epc_version …
crispy-wonton Oct 24, 2023
1636960
update postcode df to reduce row loss when adding country col to MCS …
crispy-wonton Oct 24, 2023
2792329
add new total cumulative MCS installations plot
crispy-wonton Oct 25, 2023
b80d5da
Fix merge conflicts and merge branch 'dev' into 10_add_cumulative_mcs…
crispy-wonton Oct 30, 2023
4bff00c
remove unnecessary data paths not addressed in merge
crispy-wonton Oct 30, 2023
ae42bc6
add note for clarity
crispy-wonton Oct 31, 2023
5273ab7
add new graph file to outputs section in README
crispy-wonton Nov 1, 2023
0855652
update cumulative installations graph name to be more descriptive
crispy-wonton Nov 1, 2023
cee0675
get new gold merged dataset
crispy-wonton Nov 2, 2023
be738e3
get enhanced gold data
crispy-wonton Nov 2, 2023
95493ad
update produce plots to use new gold data instead of enhanced mcs whe…
crispy-wonton Nov 2, 2023
37bd4d3
Fix merge conflict - merge branch '10_add_cumulative_mcs_installation…
crispy-wonton Nov 2, 2023
53ecce2
add gold MCS EPC figures
crispy-wonton Nov 2, 2023
9b1b6e4
change enhanced_mcs function to work for MCS and gold data
crispy-wonton Nov 2, 2023
4b7ce75
fix file path
crispy-wonton Nov 2, 2023
0cfe535
add new gold args and output files to readme
crispy-wonton Nov 3, 2023
6e8ed9b
update requirements to add dask and others
crispy-wonton Nov 3, 2023
430d848
fix merge conflicts - merge branch 'dev' into 13_add_gold_mcs_epc_data
crispy-wonton Nov 3, 2023
6437ad1
add gold batch to october 23 analysis in readme
crispy-wonton Nov 3, 2023
a2db691
remove duplicate code and update get_enhanced_mcs to get_combined_mcs
crispy-wonton Nov 3, 2023
4352af1
add missing args to function and update get_enhanced_mcs to get_enhan…
crispy-wonton Nov 3, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 12 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,24 +19,31 @@ The remainder of the charts in the response can be produced from code in the rep
- Activate conda environment: `conda activate asf_welsh_energy_consultation`
- Run `make inputs-pull` to pull the zipped supplementary data from S3 and put it in `/inputs/data`. There will be one folder per historical analysis
containing the supplementary data files as listed in the `Historical analysis` section below.

## Run the script

- Run `python asf_welsh_energy_consultation/analysis/produce_plots_and_stats.py --local_data_dir <YOUR_LOCAL_DIR>`. You need to specify the path to the local
directory where your local copy of the EPC data is/will be saved by replacing `<YOUR_LOCAL_DIR>` with the path to your "ASF_data" directory or equivalent.
If you don't have a local directory for ASF core data, you can create a folder called "ASF_data" in your home directory.
- You can specify which batch of EPC data to download and MCS data to load from S3 by passing the `--epc_batch` and `--mcs_batch` arguments, both
default to downloading/loading the newest data from S3, respectively.
- You can specify which supplementary data folder to use by passing the `--supp_data` argument. It defaults to using the latest supplementary data folder.
- You can specify which batch of gold MCS-EPC merged data to use with the `--gold_mcs_epc_batch` argument. Passing batch as YYMMDD.
- If you wish to download and process a new gold MCS-EPC batch (i.e. a different batch from the preprocessed `hp_installed_gold_[YYMMDD].csv` file in the supplementary data folder
in `inputs/data`), you can download and process a new gold MCS-EPC merged dataset by setting the `--download_gold_data_from_s3` argument to `True`. Note that this download can take ~30 minutes.
- Run `python asf_welsh_energy_consultation/analysis/produce_plots_and_stats.py -h` for more info.
- To recreate the full October 2023 analysis, set the `--calculate_average_installations` argument to `True`. This will calculate some additional numbers on MCS installations per year included in the October 2023 response. For other historical analyses, this argument is not required and defaults to `False`.
- Run `python asf_welsh_energy_consultation/analysis/produce_plots_and_stats.py -h` for more info.

The script should generate the following seven plots which will be saved in your local repo in `outputs/figures`:
## The script should generate the following ten plots which will be saved in your local repo in `outputs/figures`:

- `cumulative_retrofits.html`
- `electric_tenure.html`
- `installations_by_gas_status.html`
- `installations_by_rurality.html`
- `[gold_]installations_by_gas_status.html`
- `[gold_]installations_by_rurality.html`
- `new_build_hp_cumulative.html`
- `new_build_hp_proportion.html`
- `total_cumulative_installations.html`
- `[gold_]total_cumulative_installations.html`

It should generate a further 10 plots, five in English and five in Welsh, saved in `outputs/figures/english` and `outputs/figures/welsh`, respectively:

Expand Down Expand Up @@ -79,6 +86,7 @@ Versions/batches of data used for previous analysis are listed below.
October 2023 analysis (`/inputs/data/data_202310`):

- EPC: 2023_Q2_complete (preprocessed, and preprocessed and deduplicated)
- EPC & MCS gold merged: batcgh 231009
- mcs_installations_231009.csv
- mcs_installations_epc_full_231009.csv
- dwellings_2021.xlsx - [Number of dwellings by housing characteristics in England and Wales 2021 (released 30 March 2023)](https://www.ons.gov.uk/peoplepopulationandcommunity/housing/datasets/numberofdwellingsbyhousingcharacteristicsinenglandandwales)
Expand Down
106 changes: 94 additions & 12 deletions asf_welsh_energy_consultation/analysis/produce_plots_and_stats.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,9 +34,11 @@

if __name__ == "__main__":
# ======================================================
# MCS installations, by off-gas status

total_cumulative_installations = process_data.get_total_cumsums()
# Total MCS installations
enhanced_mcs = process_data.get_enhanced_combined(mcs_or_gold="mcs")
total_cumulative_installations = process_data.get_total_cumsums(
data=enhanced_mcs, installation_date_col="commission_date"
)

total_cumulative_installations_chart = time_series_comparison(
data=total_cumulative_installations,
Expand All @@ -52,7 +54,10 @@
# MCS installations, by off-gas status

installations_by_gas_status = process_data.cumsums_by_variable(
"off_gas", "Gas status"
"off_gas",
"Gas status",
data=enhanced_mcs,
installation_date_col="commission_date",
)

installations_by_gas_status_chart = time_series_comparison(
Expand All @@ -72,7 +77,10 @@
# MCS installations, by rurality

installations_by_rurality = process_data.cumsums_by_variable(
"rurality_2_label", "Rurality"
"rurality_2_label",
"Rurality",
data=enhanced_mcs,
installation_date_col="commission_date",
)

installations_by_rurality_chart = time_series_comparison(
Expand All @@ -89,6 +97,70 @@
output_dir=output_folder,
)

# ======================================================
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From here are the new graphs that use MCS -EPC gold

# Total MCS and EPC installations
enhanced_combined = process_data.get_enhanced_combined(mcs_or_gold="gold")
gold_total_cumulative_installations = process_data.get_total_cumsums(
data=enhanced_combined, installation_date_col="HP_INSTALL_DATE"
)

gold_total_cumulative_installations_chart = time_series_comparison(
data=gold_total_cumulative_installations,
title="Cumulative heat pump installations over time",
y_var="cumsum:Q",
y_title="Number of heat pump installations",
color_var="colour:N",
filename="gold_total_cumulative_installations",
output_dir=output_folder,
)

# ======================================================
# MCS and EPC installations, by off-gas status

gold_installations_by_gas_status = process_data.cumsums_by_variable(
"off_gas",
"Gas status",
data=enhanced_combined,
installation_date_col="HP_INSTALL_DATE",
)

gold_installations_by_gas_status_chart = time_series_comparison(
data=gold_installations_by_gas_status,
title=[
"Cumulative number of heat pump installations in Welsh homes",
"located in off- and on-gas postcodes",
],
y_var="Number of heat pumps:Q",
y_title="Number of heat pump installations",
color_var="Gas status:N",
filename="gold_installations_by_gas_status",
output_dir=output_folder,
)

# ======================================================
# MCS and EPC installations, by rurality

gold_installations_by_rurality = process_data.cumsums_by_variable(
"rurality_2_label",
"Rurality",
data=enhanced_combined,
installation_date_col="HP_INSTALL_DATE",
)

gold_installations_by_rurality_chart = time_series_comparison(
data=gold_installations_by_rurality,
title=[
"Cumulative number of heat pump installations",
"in Welsh homes located in rural vs urban postcodes",
],
y_var="Number of heat pumps:Q",
y_title="Number of heat pump installations",
color_var="Rurality:N",
domain_max=installations_by_rurality.date.max(),
filename="gold_installations_by_rurality",
output_dir=output_folder,
)

# ======================================================
# Proportions of new builds that have heat pumps

Expand Down Expand Up @@ -148,7 +220,10 @@

mcs_retrofits = process_data.get_mcs_retrofits()
mcs_retrofit_cumsums = process_data.cumsums_by_variable(
"country", "wales_col", data=mcs_retrofits
"country",
"wales_col",
data=mcs_retrofits,
installation_date_col="commission_date",
)
# this function works without separating by category - 'wales_col' is a whole column of "Wales" (not used)

Expand Down Expand Up @@ -206,19 +281,24 @@

wales_df = load_wales_df(from_csv=False)
wales_hp = load_wales_hp(wales_df)
wales_mcs = process_data.get_enhanced_mcs()
wales_mcs = process_data.get_enhanced_combined(mcs_or_gold="mcs")

# English plots

# Key statistics
intro = "Summary statistics for heat pumps in Wales\n\n"
total_hp = f"Number of heat pumps: {len(wales_hp)}\n"
total_epc = f"Number of properties in EPC: {len(wales_df)}\n"
hp_perc = "Estimated percentage of properties with a heat pump: \
total_epc_hp = f"Number of heat pumps in EPC: {len(wales_hp)}\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this epc or epc + mcs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EPC only

total_epc_properties = f"Number of properties in EPC: {len(wales_df)}\n"
hp_perc = "Estimated percentage of EPC properties with a heat pump: \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again just epc or epc + mcs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just EPC

{:.2%}\n\n".format(
len(wales_hp) / len(wales_df)
)

total_hp = f"Number of heat pumps in MCS and EPC: {len(enhanced_combined)}\n"
total_mcs_installations = (
f"Number of MCS-certified heat pump installations: {len(enhanced_mcs)}\n"
)

tenure_value_counts = wales_hp.TENURE.value_counts(normalize=True).to_string()

epc_c_or_above_and_good_walls = wales_df.loc[
Expand Down Expand Up @@ -262,9 +342,11 @@
stats_txt.writelines(
[
intro,
total_hp,
total_epc,
total_epc_hp,
total_epc_properties,
hp_perc,
total_hp,
total_mcs_installations,
tenure_value_counts,
epc_c_wall,
epc_c_wall_proportion,
Expand Down
91 changes: 90 additions & 1 deletion asf_welsh_energy_consultation/getters/get_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,16 @@

from asf_core_data.getters.epc.data_batches import get_batch_path
from asf_core_data.config import base_config
from asf_core_data.getters.data_getters import download_core_data, logger
from asf_core_data.getters.data_getters import (
download_core_data,
logger,
download_from_s3,
)

import pandas as pd
import numpy as np
import os
import dask.dataframe as dd
crispy-wonton marked this conversation as resolved.
Show resolved Hide resolved

from argparse import ArgumentParser

Expand Down Expand Up @@ -71,6 +76,20 @@ def create_argparser():
type=bool,
)

parser.add_argument(
"--gold_mcs_epc_batch",
help="Specifies which gold merged EPC-MCS_installation-MCS_installer data batch to use. Only date required in YYMMDD format.",
type=str,
)

parser.add_argument(
"--download_gold_data_from_s3",
help="If set to True, downloads specified batch of gold merged EPC-MCS_installation-MCS_installer data from S3 locally. "
"Note that this download can take 30 minutes and not recommended if `hp_installed_gold_[YYMMDD]` already in supplementary data folder in `inputs`.",
default=False,
type=str,
)

crispy-wonton marked this conversation as resolved.
Show resolved Hide resolved
return parser


Expand Down Expand Up @@ -554,3 +573,73 @@ def load_wales_hp(wales_epc):
wales_hp = wales_epc.loc[wales_epc.HP_INSTALLED].reset_index(drop=True)

return wales_hp


def load_mcs_epc_combined():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is the main part to review

"""
Get combined gold MCS-EPC dataset filtered for rows with heat pump installations in domestic dwellings. Use local preprocessed dataset unless specified
to download data from S3. Downloaded data goes through pre-processing to produce desired pd.DataFrame.

Returns:
pd.DataFrame: Gold MCS-EPC dataset for domestic dwellings with heat pumps.
"""
args = get_args()
batch = args.gold_mcs_epc_batch
download_data = args.download_gold_data_from_s3

if not download_data:
path = os.path.join(input_data_path, f"hp_installed_gold_{batch}.csv")
return pd.read_csv(path)

else:
path = f"outputs/gold/merged_epc_mcs_installations_installers_{batch}.csv"

logger.info(f"Loading {path} from S3. This will take a while.")

download_from_s3(path_to_file=path, output_path=input_data_path)

ddf = dd.read_csv(
os.path.join(
input_data_path,
f"merged_epc_mcs_installations_installers_{batch}.csv",
),
dtype={
"HP_INSTALL_DATE": "object",
"UPRN": "object",
"installation_type": "object",
Comment on lines +607 to +609
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you setting these as object?

},
)

# Get rows with HP installed only, data already filtered for domestic only
hp_installed = ddf[ddf["HP_INSTALLED"] == True]
hp_installed = hp_installed[
[
"POSTCODE",
"INSPECTION_DATE",
"COUNTRY",
"UPRN",
"HP_INSTALLED",
"HP_TYPE",
"HP_INSTALL_DATE",
"MCS_AVAILABLE",
"EPC_AVAILABLE",
]
]

hp_installed = hp_installed.rename(columns={"POSTCODE": "postcode"})

# Convert to pandas df
df = hp_installed.compute()
crispy-wonton marked this conversation as resolved.
Show resolved Hide resolved

df["HP_INSTALL_DATE"] = pd.to_datetime(df["HP_INSTALL_DATE"])

# Batch 231009 contains data from MCS up to 30 June 2023 and data from EPC up to 31 July 2023
# Must remove additional month of EPC data for consistency
if batch == "231009":
df = df[df["HP_INSTALL_DATE"] < "2023-07-01"]
Comment on lines +638 to +639
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead you could just check max date in EPC, max date in MCS and get the min between the two max's :) it would avoid the hardcoding. Also because this might happen in future batches.

Suggested change
if batch == "231009":
df = df[df["HP_INSTALL_DATE"] < "2023-07-01"]
max_epc_date = df["HP_INSTALL_DATE"].max()
max_mcs_date = df["commision_date"].max()
df = df[df["HP_INSTALL_DATE"] <= min(max_epc_date, max_mcs_date)]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea but it doesn't look like 'commission_date' col exists in the dataset. I believe the processing combines all installation/commission dates into the HP_INSTALL_DATE col


df.to_csv(
os.path.join(input_data_path, f"hp_installed_gold_{batch}.csv"), index=False
)

return df
Loading