Support for 2 new italian newspapers - Corriere della Sera & Il Giornale #700

ruggsea · 2025-02-04T04:41:07Z

Following up on my previous contribution of La Repubblica, I've added support for two more major Italian newspapers to expand the coverage of Italian media:

Additions:

Created a new parser for Corriere della Sera (www.corriere.it)
Created a new parser for Il Giornale (www.ilgiornale.it)
Updated the IT publisher group with the new newspapers
Set up RSS feeds and sitemap crawling for both sources

What can the parsers extract?

Both parsers can extract:

Article titles
Full article body (including handling of advertisements)
Author information
Publishing dates
Topics

Testing:

Created and passed unit tests for both newspapers
Updated the Supported Sources documentation
All linting and mypy checks pass

This addition brings the total number of supported Italian newspapers to three, providing a broader coverage of Italian news sources.

ruggsea · 2025-02-04T04:51:40Z

Apparently Corriere della Sera was already implemented (in d225bc2) while I was doing it too! In any case, my implementation has many different sitemaps as I've noticed that the "items" sitemaps referenced in https://www.corriere.it/rss/sitemap_v2.xml are very incomplete

addie9800

Hey, thank you so much for going through the troube of adding the next (two) publishers. We really appreciate your efforts. Unfortunately I think you managed to choose the toughest publisher I have seen in a long time, so sorry for the numerous comments (it actually is simpler to implement a parser usually, if they actually wrap their texts in <p> elements). Also, I think in this state we will not be able to merge it into master, since we have added the images attribute (hence the tests will fail at least for the documentation and json files). So in one way or another, you would have to merge those changes into your branch. Honestly, I think the easiest way forward is to copy you additions to a new branch off master.

addie9800 · 2025-02-04T13:13:49Z

src/fundus/publishers/it/corriere_della_sera.py

+)
+
+
+class CorriereDellaSeraParser(ParserProxy):


Comparing your implementation with the existing one, the functionality seems to be very similar. To solve the merge conflicts, I would suggest keeping the existing implementation, but extending it with your topics extraction. Especially since you seem to have started your branch before the existence of images

addie9800 · 2025-02-04T13:17:10Z

src/fundus/publishers/it/il_giornale.py

+            return []
+
+        @attribute
+        def free_access(self) -> bool:


The same functionality is already inherited from the BaseParser class, which means that this probably can be removed, while maintaining the same functionality.

addie9800 · 2025-02-04T13:20:15Z

src/fundus/publishers/it/il_giornale.py

+    generic_topic_parsing,
+)
+
+logger = logging.getLogger(__name__)


I think this is a left-over

addie9800 · 2025-02-04T13:29:39Z

src/fundus/publishers/it/il_giornale.py

+        _content_selector = XPath("//div[contains(@class, 'typography--content')]")
+        _subheadline_selector = CSSSelector("div.typography--content h2:not([class])")
+        _summary_selector = CSSSelector(
+            "div.typography--content p.article__abstract, div.typography--content div.article__abstract"


div.typography--content p.article__abstract is too restrictive. On this article for example: http://ilgiornale.it/news/societ/agricoltura-crisi-costi-stelle-e-rese-calo-mettono-ginocchio-2431181.html the p element containing the summary is not a descendent of the div of class typography--content. It should suffice to just select the p and div objects of class article__abstract

addie9800 · 2025-02-04T14:16:48Z

src/fundus/publishers/it/il_giornale.py

+class IlGiornaleParser(ParserProxy):
+    class V1(BaseParser):
+        # Selectors for article body parts
+        _paragraph_selector = XPath(


I would suggest simplifying this to //div[contains(@class, 'typography--content')]//p[text() or strong or em] | //div[@class='banner banner--spaced-block banner-evo' and (text() or em or strong)] which shoud lead to the intended result

addie9800 · 2025-02-04T14:38:25Z

src/fundus/publishers/it/__init__.py

+        },
+        sources=[
+            RSSFeed("https://www.ilgiornale.it/feed.xml"),
+            RSSFeed("https://www.ilgiornale.it/feed/rss.xml"),


This link returns a 404 error

src/fundus/publishers/it/__init__.py

addie9800 · 2025-02-04T14:41:15Z

src/fundus/publishers/it/__init__.py

+            RSSFeed("https://www.corriere.it/dynamic-feed/rss/section/frammenti-di-ferruccio-de-bortoli.xml"),
+            # Main sitemaps
+            Sitemap("https://www.corriere.it/rss/sitemap_v2.xml"),
+            Sitemap("https://www.corriere.it/salute/sitemap-dizionario-corriere-salute.xml"),


This sitemap seems to not contain any articles, as far as I can see

The dictionary entries get update by their Health editors, so I thought we could consider them articles from the Health section

While I agree that including it in the extracted articles could be beneficial, the current parser does not make those entries parsable.

addie9800 · 2025-02-04T14:46:43Z

src/fundus/publishers/it/__init__.py

+            # Section sitemaps
+            Sitemap("https://www.corriere.it/rss/sitemap/Motori.xml"),
+            Sitemap("https://www.corriere.it/rss/sitemap/Cultura.xml"),
+            Sitemap("https://vivimilano.corriere.it/sitemap_index.xml"),


This one seems to lead to an entire different subpage, which is not supported by the current parser (and also doesn't necessarily need to)

src/fundus/publishers/it/__init__.py

ruggsea · 2025-02-04T18:32:27Z

Thank you @addie9800 for the thorough review! I think I should have addressed all comments (and rebased off current master and handled the image attribute):

For Il Giornale:

Removed redundant free_access method (already in BaseParser)
Removed unused logger
Simplified selectors as suggested:
- Updated summary selector to work without typography--content restriction
- Simplified paragraph selector to the suggested XPath
Fixed type errors using proper HtmlElement handling
Used the article provided for test genearion
Removed the 404 RSS feed, keeping only the working one

For Corriere della Sera:

Kept the existing implementation and added my topics extraction from breadcrumbs
Cleaned up the sitemaps configuration:
- Removed health dictionary sitemap (not parseable)
- Removed vivimilano sitemap (different subpage)
- Kept only working dynamic sitemaps
- Better organized into categories
Let me know what you think!

addie9800

Thank you for addressing this so quickly. I just have a couple comments about the images attribute and then we should be good to go 🚀

addie9800 · 2025-02-04T18:56:54Z

src/fundus/publishers/it/il_giornale.py

+        def body(self) -> Optional[ArticleBody]:
+            # Clean up HTML by removing ads and handling em/strong tags
+            html_string = tostring(self.precomputed.doc).decode("utf-8")
+            html_string = re.sub(r"</?(em|strong)>", "", html_string)


I missed this earlier, apparently occasionally there are also <cite> elements, that mess up the extraction, so we would need to add them here as well.

Suggested change

html_string = re.sub(r"</?(em|strong)>", "", html_string)

html_string = re.sub(r"</?(em|strong|cite)>", "", html_string)

addie9800 · 2025-02-04T18:59:03Z

src/fundus/publishers/it/il_giornale.py

+            images = image_extraction(
+                doc=self.precomputed.doc,
+                paragraph_selector=self._paragraph_selector,
+                image_selector=self._image_selector,


The image selector seems to be a bit too restrictive. The images of this article are not extracted: https://www.ilgiornale.it/news/automotive/davanti-spiccano-solite-note-panda-sandero-mercato-gennaio-2433038.html

addie9800 · 2025-02-04T19:04:49Z

src/fundus/publishers/it/il_giornale.py

+                doc=self.precomputed.doc,
+                paragraph_selector=self._paragraph_selector,
+                image_selector=self._image_selector,
+                author_selector=re.compile(r"(?:Foto:?\s*)?(?P<credits>[^()]+)(?:\s*\([^)]+\))?$"),


The pattern needs to be modified to select only if an attribution is being made. What I mean by that can exemplarily be seen in this image:

Fundus-Article Cover-Image: -URL: 'https://img.ilgcdn.com/sites/default/files/styles/xl/public/foto/2025/02/01/1738434667-azs3g51p0mzhzbpw1kqf-ansa.jpeg?_=1738434667&format=webp' -Description: 'Si fingono sordomute per derubarli. Ma le vittime sono due carabinieri' -Caption: None -Authors: ['Si fingono sordomute per derubarli. Ma le vittime sono due carabinieri'] -Versions: [300x169, 500x281, 800x450, 1200x675]

Especially if the author attribution is done in the caption, the utility function would remove the match from the caption and save it as the author. But did you see any case, where there was an author attribution done by the publisher? I searched through their website but didn't see any instances. If you also didn't find any instance, we can, I think, safely remove the custom author_selector and modify it accordingly, should we at some point stumble across an article image author attribution.

Ok makes sense!

addie9800 · 2025-02-04T19:10:27Z

src/fundus/publishers/it/il_giornale.py

+            )
+
+            # Try to get cover image from meta tags if no images found
+            if not images:


Ideally, the image_selector should be chosen in a way that makes the fallback unnecessary, which is why I would like to remove the fallback case. The reason is that this would work for the cover image but not subsequent images in the article, making it harder to notice the error and fix it (by modifying the image_selector or creating a new parser version, since the most likely cause of such an issue would be a layout change, which will most likely also require other modifications).

ruggsea · 2025-02-04T19:51:29Z

Thanks for the fast feedback. I should have fixed/made more broad the image selection for IlGiornale

Fixed image selector to capture all images (it should now work with the automotive article)
Removed fallback case for cover images
Removed unused author_selector since no attributions found

Also, now the tags are being included as well!

addie9800

This looks great; thank you so much for adding and improving these two publishers 👍

MaxDall · 2025-02-05T12:02:21Z

@ruggsea Thanks a lot for adding this ❤️ and thanks a lot for the quick and in-depth review @addie9800. I'm gonna merge this now and prepare a new minor release.

addie9800 self-assigned this Feb 4, 2025

addie9800 requested changes Feb 4, 2025

View reviewed changes

added support for corriere della sera

3ad42c8

ruggsea force-pushed the other-italian-newspapers branch from 005b328 to 3ad42c8 Compare February 4, 2025 18:27

addie9800 requested changes Feb 4, 2025

View reviewed changes

Fixing image attribute for IlGiornale

0bdd503

addie9800 approved these changes Feb 4, 2025

View reviewed changes

MaxDall merged commit 7455d33 into flairNLP:master Feb 5, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for 2 new italian newspapers - Corriere della Sera & Il Giornale #700

Support for 2 new italian newspapers - Corriere della Sera & Il Giornale #700

ruggsea commented Feb 4, 2025

ruggsea commented Feb 4, 2025

addie9800 left a comment

addie9800 Feb 4, 2025

addie9800 Feb 4, 2025

addie9800 Feb 4, 2025

addie9800 Feb 4, 2025

addie9800 Feb 4, 2025

addie9800 Feb 4, 2025

addie9800 Feb 4, 2025

ruggsea Feb 4, 2025

addie9800 Feb 4, 2025

addie9800 Feb 4, 2025

ruggsea commented Feb 4, 2025

addie9800 left a comment

addie9800 Feb 4, 2025

addie9800 Feb 4, 2025

addie9800 Feb 4, 2025

ruggsea Feb 4, 2025

addie9800 Feb 4, 2025

ruggsea commented Feb 4, 2025

addie9800 left a comment

MaxDall commented Feb 5, 2025

	html_string = re.sub(r"</?(em\|strong)>", "", html_string)
	html_string = re.sub(r"</?(em\|strong\|cite)>", "", html_string)

Support for 2 new italian newspapers - Corriere della Sera & Il Giornale #700

Support for 2 new italian newspapers - Corriere della Sera & Il Giornale #700

Conversation

ruggsea commented Feb 4, 2025

Additions:

What can the parsers extract?

Testing:

ruggsea commented Feb 4, 2025

addie9800 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruggsea commented Feb 4, 2025

addie9800 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruggsea commented Feb 4, 2025

addie9800 left a comment

Choose a reason for hiding this comment

MaxDall commented Feb 5, 2025