-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use GitHub GraphQL for Metadata fetching (with new metadata fields) #138
base: master
Are you sure you want to change the base?
Conversation
- Created new function add_gh_metadata - Use github graphql api to get all github metadata - Get all metadata already via old function - Get latest release with tag and date - Get commit history with commit count (only the for the current month) - Created new function gh_metadata_cleanup - Clean up old commit history wich is older then 12 months This code is not tested yet, tbd.
- Removed old `get_gh_metadata` function and renamed new function to the same name - Set GitHub graphql API batch amount to 60 to avoid API errors - Fixed issue that `isArchived` field did not exist in the response - Added simple error handling for the case that the github metadata could not be fetched - Fixed duplicated values for `stargazers_count` and `updated_at` - Fixed date syntax for `current_release/published_at` and `commit_history`
- Re-implement sleep time for GitHub API to avaoid rate limit - Fix gh_metadata_cleanup task
Thank you. |
Pending Issues have been fixed, and a test has been done by running a full metadata processing on the awesome-selfhosted-data repository. Some things in the code still have open Questions; see comments marked with |
These changes have been tested by running a full metadata processing on the awesome-selfhosted-data repository and checking the metadata files for the correct affiliation. Bug Fixes: - Metadata is now not being assigned via a index but instead by matching the `url` field in the return of GraphQL query to the `source_code_url` Logging: - Added more information about the status of the metadata processing (as this can now take a while to process) - Added more debug information for Ratelimit Information from GitHub API Defaults: - Added a default wait-time between API requests to GitHub to avoid hitting the rate limit (default is now 60 seconds, can be configured in the `hecat.yml` file) - Added a default batch-size for the metadata processing (default is now 30, can be configured in the `hecat.yml` file) Others: - Added new function `extract_repo_name` to extract the repo name from the `source_code_url` - Added try-catch block to catch exceptions when the writing metadata to a file - Updated documentation to reflect new batch_size configuration option and new API restrictions from GitHub Co-authored-by: Le Duc Lischetzke <[email protected]>
Just a quick ping to remind you about this. 😊 |
This is my best try on implement awesome-selfhosted/awesome-selfhosted-data#84, as well as some other nice-to-have metadata (which I personally would like to see, such as the current release and release date), which means fetching the metadata over the GrapSQL API instead of the Python package used.
New metadata preview:
The new changes have been tested by running the full metadata processing on the awesome-selfhosted-data repo (last tested May 5). However I still have some open questions regarding how we should do some things, these things have been marked with
TODO:
in the code itself.Additionally, there have been no tests written or modified (since I don't code normally or use python), so this is still open and would need to be done.