104 save intermediate gardens #110

crispy-wonton · 2025-01-15T11:46:25Z

Fixes #104

Description

Update script that calculates garden size to save intermediate results after every 100 file matches, and refactor deduplication process to be less memory intensive. These changes are to help run the script without killing the process, which is very memory intensive as there are ~20million gardens calculated before deduplication.

modified asf_heat_pump_suitability/pipeline/run_scripts/run_calculate_garden_size.py.

Note: this script was run successfully on a g3s.xlarge EC2 instance with 50GiB volume. The process was killed during deduplication after successfully saving out interim results, so I made edits to the script (which are in the final version here) and ran that part separately.

Instructions for Reviewer

Please could you read through the code and ensure that files are saved out correctly. The main thing to pay attention to is ensuring that there is a) no unintended duplication of data (e.g. saving files [1-100, 1-200] instead of [1-100, 101-200]); and b) no unintended loss of data (e.g. resetting epc_gardens to empty list when not appropriate, or missing the first or final file matches).

To run:
python -i asf_heat_pump_suitability/pipeline/run_scripts/run_calculate_garden_size.py --epc s3://asf-daps/lakehouse/processed/epc/old/deduplicated/processed_dedupl-0.parquet -y 2023 -q 4 -n ews

You can edit and/or test run the script if you like just to check it works, but I obviously wouldn't suggest running the whole thing!

Checklist:

…dfs in run_calculate_garden_size.py

…e_garden_size.py`

…garden_size.py

…files

… record as an interim file

…alculate_garden_size.py

…_calculate_garden_size.py

…e_garden_size.py`

crispy-wonton · 2025-01-16T10:37:11Z

asf_heat_pump_suitability/pipeline/run_scripts/run_calculate_garden_size.py

+    # Final round of deduplication
    epc_gardens_df = garden_size.deduplicate_df_garden_size(epc_gardens_df)


Note, this final round of deduplication will mean that the median values we created in the loop above for duplicates within interim files will be used to generate another median. So the final deduplication for any duplicates that remain at this point will calculate the median of the median & other individual values. This is not optimal. Do you have a suggestion for how we could improve this?

crispy-wonton added 12 commits January 9, 2025 16:43

save intermediate garden estimates from garden pipeline to s3

960540e

add 'gardens' subdir and save garden estimate files to it

2feb6da

convert nationalcadastralref column to string type for concatenating …

35d709f

…dfs in run_calculate_garden_size.py

add script to concatenate interim garden size results `run_concatenat…

43914ee

…e_garden_size.py`

change interim file saving so that results are saved in batches of 100

991e0d4

update file save_as name with low end of file range in run_calculate_…

0804de3

…garden_size.py

update run_calculate_garden_size.py to load and concat saved interim …

b1dcb7f

…files

update run_calculate_garden_size.py so that it doesn't save the first…

dad93bf

… record as an interim file

delete run_concatenate_garden_size.py because code is combined into c…

47443a4

…alculate_garden_size.py

deduplicate garden size estimates in less memory intensive way in run…

f94de07

…_calculate_garden_size.py

Merge branch 'dev' into 104_save_intermediate_gardens

39ac0b2

update interim file save_as name to be more accurate in `run_calculat…

e88e5da

…e_garden_size.py`

crispy-wonton marked this pull request as ready for review January 15, 2025 11:48

crispy-wonton commented Jan 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

104 save intermediate gardens #110

104 save intermediate gardens #110

crispy-wonton commented Jan 15, 2025 •

edited

Loading

crispy-wonton Jan 16, 2025

		# Final round of deduplication
		epc_gardens_df = garden_size.deduplicate_df_garden_size(epc_gardens_df)

104 save intermediate gardens #110

Are you sure you want to change the base?

104 save intermediate gardens #110

Conversation

crispy-wonton commented Jan 15, 2025 • edited Loading

Description

Instructions for Reviewer

Checklist:

crispy-wonton Jan 16, 2025

Choose a reason for hiding this comment

crispy-wonton commented Jan 15, 2025 •

edited

Loading