Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publisher Coverage Script and Badge #324

Merged
merged 12 commits into from
Nov 6, 2023
Merged

Publisher Coverage Script and Badge #324

merged 12 commits into from
Nov 6, 2023

Conversation

dobbersc
Copy link
Collaborator

This PR adds a script to execute a baseline real-time crawler validation. A problem with the current crawler unit tests is that the crawlers are only validated on pre-fetched articles. This script aims to check a publisher's availability and basic crawl results.
The tests include a real-time crawl for each publisher's news map and RSS Feed checking the received articles for attribute completeness. Note that this script does not check the attributes' correctness, only their presence. The resulting publisher coverage is then summarised in the output report.

To automate the publisher coverage script this PR includes an action that runs every day at 01:00 generating a badge containing the publisher coverage displayed in the README. See here for a current action run.

To enhance the functionality we could also run the tests for each source independently, e.g. news map and RSS feed if specified, such that a crawler only passes the tests if all specified sources return appropriate articles.

@dobbersc dobbersc requested a review from MaxDall October 24, 2023 20:46
@MaxDall
Copy link
Collaborator

MaxDall commented Nov 2, 2023

@dobbersc Thanks for adding this, that's exactly what we needed.

for publisher in sorted(
publisher_region, key=lambda p: cast(PublisherEnum, p).name # type: ignore[no-any-return]
):
publisher_name: str = publisher.name # type: ignore[attr-defined]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole construct seems unnecessary to me, also my local mypy installation runs with no error when removing publisher_name and use publisher.name again, as you did before 6f4f114.

I was running mypy 1.1.1|1.6.1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree with you. On my local machine, this part is also not necessary. Although, for some reason, the CI fails without this change (CI Log). Just to be sure I'll try it again without. The enum typing is unfortunately a bit messy :/.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very weird. I'm running the same version of mypy on a complete fresh conda installation of fundus (cleared cache etc.) and get no complaints locally from mypy while the CI's mypy fails.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for testing this. That is indeed very weird.

The enum typing is unfortunately a bit messy :/

I agree. I will add something about this to #325

Do you have any clue what's causing this discrepancy between CI and local runs? My next suggestion would be to remove the mypy version constraints from pyproject.toml and see if this is a version-specific problem or some problem with the CI in general. What do you think?

@dobbersc dobbersc merged commit 3d82fb8 into master Nov 6, 2023
@dobbersc dobbersc deleted the publisher-coverage branch November 6, 2023 11:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants