Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use GitHub GraphQL for Metadata fetching (with new metadata fields) #138

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

Rabenherz112
Copy link
Contributor

@Rabenherz112 Rabenherz112 commented May 4, 2024

This is my best try on implement awesome-selfhosted/awesome-selfhosted-data#84, as well as some other nice-to-have metadata (which I personally would like to see, such as the current release and release date), which means fetching the metadata over the GrapSQL API instead of the Python package used.

New metadata preview:

name: Paperless-ngx
website_url: https://docs.paperless-ngx.com/
description: Scan, index, and archive all of your paper documents with an improved interface (fork of Paperless).
licenses:
  - GPL-3.0
platforms:
  - Python
  - Docker
tags:
  - Document Management
source_code_url: https://github.com/paperless-ngx/paperless-ngx
demo_url: https://demo.paperless-ngx.com/
stargazers_count: 17058
updated_at: '2024-05-07'
archived: false
current_release:
  tag: v2.8.1
  published_at: '2024-05-07'
commit_history:
  2024-05: 46

The new changes have been tested by running the full metadata processing on the awesome-selfhosted-data repo (last tested May 5). However I still have some open questions regarding how we should do some things, these things have been marked with TODO: in the code itself.

Additionally, there have been no tests written or modified (since I don't code normally or use python), so this is still open and would need to be done.

- Created new function add_gh_metadata
    - Use github graphql api to get all github metadata
    - Get all metadata already via old function
    - Get latest release with tag and date
    - Get commit history with commit count (only the for the current month)
- Created new function gh_metadata_cleanup
    - Clean up old commit history wich is older then 12 months

This code is not tested yet, tbd.
- Removed old `get_gh_metadata` function and renamed new function to the same name
- Set GitHub graphql API batch amount to 60 to avoid API errors
- Fixed issue that `isArchived` field did not exist in the response
- Added simple error handling for the case that the github metadata could not be fetched
- Fixed duplicated values for `stargazers_count` and `updated_at`
- Fixed date syntax for `current_release/published_at` and `commit_history`
- Re-implement sleep time for GitHub API to avaoid rate limit
- Fix gh_metadata_cleanup task
@Rabenherz112 Rabenherz112 marked this pull request as draft May 4, 2024 15:09
@Rabenherz112 Rabenherz112 changed the title Use GitHub GraphQL for Metadat fetching (with new metadata fields) Use GitHub GraphQL for Metadata fetching (with new metadata fields) May 4, 2024
@nodiscc nodiscc added enhancement New feature or request help wanted Extra attention is needed labels May 4, 2024
@nodiscc nodiscc added this to the 1.3.0 milestone May 4, 2024
@nodiscc
Copy link
Owner

nodiscc commented May 5, 2024

Thank you.
I will review this and #133 when I get some time, it might take a while, I will do it eventually but don't know when.

@nodiscc nodiscc removed the help wanted Extra attention is needed label May 5, 2024
@Rabenherz112
Copy link
Contributor Author

Rabenherz112 commented May 5, 2024

Pending Issues have been fixed, and a test has been done by running a full metadata processing on the awesome-selfhosted-data repository.

Some things in the code still have open Questions; see comments marked with TODO:.

These changes have been tested by running a full metadata processing on the awesome-selfhosted-data repository and checking the metadata files for the correct affiliation.

Bug Fixes:
- Metadata is now not being assigned via a index but instead by matching the `url` field in the return of GraphQL query to the `source_code_url`

Logging:
- Added more information about the status of the metadata processing (as this can now take a while to process)
- Added more debug information for Ratelimit Information from GitHub API

Defaults:
- Added a default wait-time between API requests to GitHub to avoid hitting the rate limit (default is now 60 seconds, can be configured in the `hecat.yml` file)
- Added a default batch-size for the metadata processing (default is now 30, can be configured in the `hecat.yml` file)

Others:
- Added new function `extract_repo_name` to extract the repo name from the `source_code_url`
- Added try-catch block to catch exceptions when the writing metadata to a file
- Updated documentation to reflect new batch_size configuration option and new API restrictions from GitHub

Co-authored-by: Le Duc Lischetzke <[email protected]>
@Rabenherz112
Copy link
Contributor Author

Just a quick ping to remind you about this. 😊
No rush at all, though!

@nodiscc nodiscc self-requested a review November 18, 2024 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants