-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for 2 new italian newspapers - Corriere della Sera & Il Giornale #700
Conversation
Apparently Corriere della Sera was already implemented (in d225bc2) while I was doing it too! In any case, my implementation has many different sitemaps as I've noticed that the "items" sitemaps referenced in https://www.corriere.it/rss/sitemap_v2.xml are very incomplete |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, thank you so much for going through the troube of adding the next (two) publishers. We really appreciate your efforts. Unfortunately I think you managed to choose the toughest publisher I have seen in a long time, so sorry for the numerous comments (it actually is simpler to implement a parser usually, if they actually wrap their texts in <p>
elements). Also, I think in this state we will not be able to merge it into master, since we have added the images attribute (hence the tests will fail at least for the documentation and json files). So in one way or another, you would have to merge those changes into your branch. Honestly, I think the easiest way forward is to copy you additions to a new branch off master.
) | ||
|
||
|
||
class CorriereDellaSeraParser(ParserProxy): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comparing your implementation with the existing one, the functionality seems to be very similar. To solve the merge conflicts, I would suggest keeping the existing implementation, but extending it with your topics extraction. Especially since you seem to have started your branch before the existence of images
return [] | ||
|
||
@attribute | ||
def free_access(self) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same functionality is already inherited from the BaseParser class, which means that this probably can be removed, while maintaining the same functionality.
generic_topic_parsing, | ||
) | ||
|
||
logger = logging.getLogger(__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a left-over
_content_selector = XPath("//div[contains(@class, 'typography--content')]") | ||
_subheadline_selector = CSSSelector("div.typography--content h2:not([class])") | ||
_summary_selector = CSSSelector( | ||
"div.typography--content p.article__abstract, div.typography--content div.article__abstract" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
div.typography--content p.article__abstract
is too restrictive. On this article for example: http://ilgiornale.it/news/societ/agricoltura-crisi-costi-stelle-e-rese-calo-mettono-ginocchio-2431181.html the p
element containing the summary is not a descendent of the div
of class typography--content
. It should suffice to just select the p
and div
objects of class article__abstract
class IlGiornaleParser(ParserProxy): | ||
class V1(BaseParser): | ||
# Selectors for article body parts | ||
_paragraph_selector = XPath( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest simplifying this to //div[contains(@class, 'typography--content')]//p[text() or strong or em] | //div[@class='banner banner--spaced-block banner-evo' and (text() or em or strong)]
which shoud lead to the intended result
src/fundus/publishers/it/__init__.py
Outdated
}, | ||
sources=[ | ||
RSSFeed("https://www.ilgiornale.it/feed.xml"), | ||
RSSFeed("https://www.ilgiornale.it/feed/rss.xml"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This link returns a 404 error
src/fundus/publishers/it/__init__.py
Outdated
RSSFeed("https://www.corriere.it/dynamic-feed/rss/section/frammenti-di-ferruccio-de-bortoli.xml"), | ||
# Main sitemaps | ||
Sitemap("https://www.corriere.it/rss/sitemap_v2.xml"), | ||
Sitemap("https://www.corriere.it/salute/sitemap-dizionario-corriere-salute.xml"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sitemap seems to not contain any articles, as far as I can see
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dictionary entries get update by their Health editors, so I thought we could consider them articles from the Health section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I agree that including it in the extracted articles could be beneficial, the current parser does not make those entries parsable.
src/fundus/publishers/it/__init__.py
Outdated
# Section sitemaps | ||
Sitemap("https://www.corriere.it/rss/sitemap/Motori.xml"), | ||
Sitemap("https://www.corriere.it/rss/sitemap/Cultura.xml"), | ||
Sitemap("https://vivimilano.corriere.it/sitemap_index.xml"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one seems to lead to an entire different subpage, which is not supported by the current parser (and also doesn't necessarily need to)
005b328
to
3ad42c8
Compare
Thank you @addie9800 for the thorough review! I think I should have addressed all comments (and rebased off current master and handled the image attribute): For Il Giornale:
For Corriere della Sera:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for addressing this so quickly. I just have a couple comments about the images attribute and then we should be good to go 🚀
def body(self) -> Optional[ArticleBody]: | ||
# Clean up HTML by removing ads and handling em/strong tags | ||
html_string = tostring(self.precomputed.doc).decode("utf-8") | ||
html_string = re.sub(r"</?(em|strong)>", "", html_string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed this earlier, apparently occasionally there are also <cite>
elements, that mess up the extraction, so we would need to add them here as well.
html_string = re.sub(r"</?(em|strong)>", "", html_string) | |
html_string = re.sub(r"</?(em|strong|cite)>", "", html_string) |
images = image_extraction( | ||
doc=self.precomputed.doc, | ||
paragraph_selector=self._paragraph_selector, | ||
image_selector=self._image_selector, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The image selector seems to be a bit too restrictive. The images of this article are not extracted: https://www.ilgiornale.it/news/automotive/davanti-spiccano-solite-note-panda-sandero-mercato-gennaio-2433038.html
doc=self.precomputed.doc, | ||
paragraph_selector=self._paragraph_selector, | ||
image_selector=self._image_selector, | ||
author_selector=re.compile(r"(?:Foto:?\s*)?(?P<credits>[^()]+)(?:\s*\([^)]+\))?$"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pattern needs to be modified to select only if an attribution is being made. What I mean by that can exemplarily be seen in this image:
Fundus-Article Cover-Image:
-URL: 'https://img.ilgcdn.com/sites/default/files/styles/xl/public/foto/2025/02/01/1738434667-azs3g51p0mzhzbpw1kqf-ansa.jpeg?_=1738434667&format=webp'
-Description: 'Si fingono sordomute per derubarli. Ma le vittime sono due carabinieri'
-Caption: None
-Authors: ['Si fingono sordomute per derubarli. Ma le vittime sono due carabinieri']
-Versions: [300x169, 500x281, 800x450, 1200x675]
Especially if the author attribution is done in the caption, the utility function would remove the match from the caption and save it as the author. But did you see any case, where there was an author attribution done by the publisher? I searched through their website but didn't see any instances. If you also didn't find any instance, we can, I think, safely remove the custom author_selector and modify it accordingly, should we at some point stumble across an article image author attribution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok makes sense!
) | ||
|
||
# Try to get cover image from meta tags if no images found | ||
if not images: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, the image_selector
should be chosen in a way that makes the fallback unnecessary, which is why I would like to remove the fallback case. The reason is that this would work for the cover image but not subsequent images in the article, making it harder to notice the error and fix it (by modifying the image_selector
or creating a new parser version, since the most likely cause of such an issue would be a layout change, which will most likely also require other modifications).
Thanks for the fast feedback. I should have fixed/made more broad the image selection for IlGiornale
Also, now the tags are being included as well! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great; thank you so much for adding and improving these two publishers 👍
@ruggsea Thanks a lot for adding this ❤️ and thanks a lot for the quick and in-depth review @addie9800. I'm gonna merge this now and prepare a new minor release. |
Following up on my previous contribution of La Repubblica, I've added support for two more major Italian newspapers to expand the coverage of Italian media:
Additions:
What can the parsers extract?
Both parsers can extract:
Testing:
This addition brings the total number of supported Italian newspapers to three, providing a broader coverage of Italian news sources.