-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
104 save intermediate gardens #110
base: dev
Are you sure you want to change the base?
Conversation
…dfs in run_calculate_garden_size.py
…e_garden_size.py`
… record as an interim file
…alculate_garden_size.py
…_calculate_garden_size.py
…e_garden_size.py`
# Final round of deduplication | ||
epc_gardens_df = garden_size.deduplicate_df_garden_size(epc_gardens_df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, this final round of deduplication will mean that the median values we created in the loop above for duplicates within interim files will be used to generate another median. So the final deduplication for any duplicates that remain at this point will calculate the median of the median & other individual values. This is not optimal. Do you have a suggestion for how we could improve this?
Fixes #104
Description
Update script that calculates garden size to save intermediate results after every 100 file matches, and refactor deduplication process to be less memory intensive. These changes are to help run the script without killing the process, which is very memory intensive as there are ~20million gardens calculated before deduplication.
asf_heat_pump_suitability/pipeline/run_scripts/run_calculate_garden_size.py
.Note: this script was run successfully on a
g3s.xlarge
EC2 instance with 50GiB volume. The process was killed during deduplication after successfully saving out interim results, so I made edits to the script (which are in the final version here) and ran that part separately.Instructions for Reviewer
Please could you read through the code and ensure that files are saved out correctly. The main thing to pay attention to is ensuring that there is a) no unintended duplication of data (e.g. saving files [1-100, 1-200] instead of [1-100, 101-200]); and b) no unintended loss of data (e.g. resetting
epc_gardens
to empty list when not appropriate, or missing the first or final file matches).To run:
python -i asf_heat_pump_suitability/pipeline/run_scripts/run_calculate_garden_size.py --epc s3://asf-daps/lakehouse/processed/epc/old/deduplicated/processed_dedupl-0.parquet -y 2023 -q 4 -n ews
You can edit and/or test run the script if you like just to check it works, but I obviously wouldn't suggest running the whole thing!
Checklist:
notebooks/
pre-commit
and addressed any issues not automatically fixeddev
README
s