Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update loading functions, fix various issues and complete pipeline #44

Merged
merged 128 commits into from
Mar 20, 2023

Conversation

athrado
Copy link
Contributor

@athrado athrado commented Mar 1, 2023

Dear reviewers,

Thanks you for reviewing and testing this branch of asf-core-data. This PR contains adjustments to the code for many different issues and challenges with the code. The most important are:

  • Option for automatically loading EPC, MCS and other data from S3, and also writing it directly to S3
  • Handling new EPC data batches and all the challenges that come with it (new Scotland data fields etc.)
  • Update and fix loading functions, e.g. loading EPC by country
  • Improve and complete pipeline for processing EPC and MCS data, and merge in the end
  • Notebooks for showcasing how to load data and how to run entire processing pipeline
  • Cosmetic changes

This PR will resolve the following issues: #19, #24, #25, #26, #28, #29 (will be picked up in new issue again), #30, #32, #33, #34, #35, #36, #38, #39, #40.

It would be very much appreciated if the following scripts could be reviewed:

Suggestions for @ch-williamson:

  • asf_core_data/getters/data_download.py: mostly derived from MCS and India's code, just have a read-through
  • asf_core_data/getters/data_getters.py: contains some redunancy, can be tackled when merged with the other branches, for now please just sense-check and maybe check whether doc strings make sense
  • asf_core_data/getters/epc/data_batches.py: first time going through review so please have a look; might be moved up one level in future as it now also includes batch checks for MCS
  • asf_core_data/getters/epc/epc_data.py: quite a few changes from original but mainly regarding paths, arguments, namings etc. As the functionality is tested through the notebook, no super-thorough review needed, just flag if anything strikes you as odd / doesn't make sense.
  • data_loading.ipynb: if there's time, it would be awesome if you could also run this notebook and check whether everything works for you. This will take a while to run, so I suggest during a meeting. :-)

Suggestions for @sofiapinto:

  • asf_core_data/pipeline/data_joining/merge_install_dates.py: has been reviewed in the past so feel free to just look at the changes, or if it's easier look at everything. The diagram I shared on slack might help understand the logic.
  • asf_core_data/pipeline/data_joining/merge_proc_datasets.py: has not been reviewed yet, please sense-check
  • asf_core_data/pipeline/preprocessing/data_cleaning.py: has been reviewed before, only a few functions were updated, so just have a look at them
  • asf_core_data/pipeline/preprocessing/feature_engineering.py: has been reviewed before, only very few changes
  • asf_core_data/pipeline/preprocessing/preprocess_epc_data.py: has been reviewed before, just a couple of typo fixes, kawrgs adjustments and reload data option
  • data_loading.ipynb: if there's time, it would be awesome if you could also run this notebook and check whether everything works for you. This will take a while to run, so I suggest during a meeting. :-)
  • process_new_data.ipynb : feel free to run all of it (please read instructions and notes carefully) or skip the EPC processing and start from generate_and_save_mcs(verbose=True).

Easy wins :)

  • asf_core_data/getters/supplementary_data/deprivation/imd_data.py: few updates, not much code
  • asf_core_data/getters/supplementary_data/geospatial/coordinates.py: few changes, one function only

Note: All MCS scripts are fully reviewed when merging with Chris' branch. My changes to those scripts were mainly path adjustments anyways. The base_config.py file will also be optimised after all the merging.

As for running the code, I would recommend running the two notebooks:

  • data_loading.ipynb : feel free to play around with other settings (other batch, other country subset, other columns)
  • process_new_data.ipynb : feel free to run all of it (please read instructions and notes carefully) or skip the EPC processing and start from generate_and_save_mcs(verbose=True).

Checklist:

  • I have refactored my code out from notebooks/
  • I have checked the code runs
  • I have tested the code
  • I have run pre-commit and addressed any issues not automatically fixed
  • I have merged any new changes from dev
  • I have documented the code
    • Major functions have docstrings
    • Appropriate information has been added to READMEs
  • I have explained the feature in this PR or (better) in output/reports/
  • I have requested a code review

@athrado athrado changed the title 19 epc from s3 Updates to loading functions and various fixes Mar 20, 2023
@athrado athrado changed the title Updates to loading functions and various fixes Update loading functions, fox various issues and complete pipeline Mar 20, 2023
@athrado athrado merged commit 2a52d82 into dev Mar 20, 2023
@athrado athrado deleted the 19_epc_from_s3 branch March 20, 2023 08:13
@athrado athrado changed the title Update loading functions, fox various issues and complete pipeline Update loading functions, fix various issues and complete pipeline Mar 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants