Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

104 save intermediate gardens #110

Open
wants to merge 12 commits into
base: dev
Choose a base branch
from

Conversation

crispy-wonton
Copy link
Collaborator

@crispy-wonton crispy-wonton commented Jan 15, 2025

Fixes #104

Description

Update script that calculates garden size to save intermediate results after every 100 file matches, and refactor deduplication process to be less memory intensive. These changes are to help run the script without killing the process, which is very memory intensive as there are ~20million gardens calculated before deduplication.

  • modified asf_heat_pump_suitability/pipeline/run_scripts/run_calculate_garden_size.py.

Note: this script was run successfully on a g3s.xlarge EC2 instance with 50GiB volume. The process was killed during deduplication after successfully saving out interim results, so I made edits to the script (which are in the final version here) and ran that part separately.

Instructions for Reviewer

Please could you read through the code and ensure that files are saved out correctly. The main thing to pay attention to is ensuring that there is a) no unintended duplication of data (e.g. saving files [1-100, 1-200] instead of [1-100, 101-200]); and b) no unintended loss of data (e.g. resetting epc_gardens to empty list when not appropriate, or missing the first or final file matches).

To run:
python -i asf_heat_pump_suitability/pipeline/run_scripts/run_calculate_garden_size.py --epc s3://asf-daps/lakehouse/processed/epc/old/deduplicated/processed_dedupl-0.parquet -y 2023 -q 4 -n ews

You can edit and/or test run the script if you like just to check it works, but I obviously wouldn't suggest running the whole thing!

Checklist:

  • I have refactored my code out from notebooks/
  • I have checked the code runs
  • I have tested the code
  • I have run pre-commit and addressed any issues not automatically fixed
  • I have merged any new changes from dev
  • I have documented the code
    • Major functions have docstrings
    • Appropriate information has been added to READMEs
  • I have explained this PR above
  • I have requested a code review

@crispy-wonton crispy-wonton marked this pull request as ready for review January 15, 2025 11:48
Comment on lines +230 to 231
# Final round of deduplication
epc_gardens_df = garden_size.deduplicate_df_garden_size(epc_gardens_df)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, this final round of deduplication will mean that the median values we created in the loop above for duplicates within interim files will be used to generate another median. So the final deduplication for any duplicates that remain at this point will calculate the median of the median & other individual values. This is not optimal. Do you have a suggestion for how we could improve this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Save intermediate results in garden size pipeline
1 participant