Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for 2 new italian newspapers - Corriere della Sera & Il Giornale #700

Merged
merged 2 commits into from
Feb 5, 2025

Conversation

ruggsea
Copy link
Contributor

@ruggsea ruggsea commented Feb 4, 2025

Following up on my previous contribution of La Repubblica, I've added support for two more major Italian newspapers to expand the coverage of Italian media:

Additions:

  • Created a new parser for Corriere della Sera (www.corriere.it)
  • Created a new parser for Il Giornale (www.ilgiornale.it)
  • Updated the IT publisher group with the new newspapers
  • Set up RSS feeds and sitemap crawling for both sources

What can the parsers extract?

Both parsers can extract:

  • Article titles
  • Full article body (including handling of advertisements)
  • Author information
  • Publishing dates
  • Topics

Testing:

  • Created and passed unit tests for both newspapers
  • Updated the Supported Sources documentation
  • All linting and mypy checks pass

This addition brings the total number of supported Italian newspapers to three, providing a broader coverage of Italian news sources.

@ruggsea
Copy link
Contributor Author

ruggsea commented Feb 4, 2025

Apparently Corriere della Sera was already implemented (in d225bc2) while I was doing it too! In any case, my implementation has many different sitemaps as I've noticed that the "items" sitemaps referenced in https://www.corriere.it/rss/sitemap_v2.xml are very incomplete

@addie9800 addie9800 self-assigned this Feb 4, 2025
Copy link
Collaborator

@addie9800 addie9800 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thank you so much for going through the troube of adding the next (two) publishers. We really appreciate your efforts. Unfortunately I think you managed to choose the toughest publisher I have seen in a long time, so sorry for the numerous comments (it actually is simpler to implement a parser usually, if they actually wrap their texts in <p> elements). Also, I think in this state we will not be able to merge it into master, since we have added the images attribute (hence the tests will fail at least for the documentation and json files). So in one way or another, you would have to merge those changes into your branch. Honestly, I think the easiest way forward is to copy you additions to a new branch off master.

)


class CorriereDellaSeraParser(ParserProxy):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparing your implementation with the existing one, the functionality seems to be very similar. To solve the merge conflicts, I would suggest keeping the existing implementation, but extending it with your topics extraction. Especially since you seem to have started your branch before the existence of images

return []

@attribute
def free_access(self) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same functionality is already inherited from the BaseParser class, which means that this probably can be removed, while maintaining the same functionality.

generic_topic_parsing,
)

logger = logging.getLogger(__name__)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a left-over

_content_selector = XPath("//div[contains(@class, 'typography--content')]")
_subheadline_selector = CSSSelector("div.typography--content h2:not([class])")
_summary_selector = CSSSelector(
"div.typography--content p.article__abstract, div.typography--content div.article__abstract"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

div.typography--content p.article__abstract is too restrictive. On this article for example: http://ilgiornale.it/news/societ/agricoltura-crisi-costi-stelle-e-rese-calo-mettono-ginocchio-2431181.html the p element containing the summary is not a descendent of the div of class typography--content. It should suffice to just select the p and div objects of class article__abstract

class IlGiornaleParser(ParserProxy):
class V1(BaseParser):
# Selectors for article body parts
_paragraph_selector = XPath(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest simplifying this to //div[contains(@class, 'typography--content')]//p[text() or strong or em] | //div[@class='banner banner--spaced-block banner-evo' and (text() or em or strong)] which shoud lead to the intended result

},
sources=[
RSSFeed("https://www.ilgiornale.it/feed.xml"),
RSSFeed("https://www.ilgiornale.it/feed/rss.xml"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link returns a 404 error

src/fundus/publishers/it/__init__.py Outdated Show resolved Hide resolved
RSSFeed("https://www.corriere.it/dynamic-feed/rss/section/frammenti-di-ferruccio-de-bortoli.xml"),
# Main sitemaps
Sitemap("https://www.corriere.it/rss/sitemap_v2.xml"),
Sitemap("https://www.corriere.it/salute/sitemap-dizionario-corriere-salute.xml"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sitemap seems to not contain any articles, as far as I can see

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dictionary entries get update by their Health editors, so I thought we could consider them articles from the Health section

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree that including it in the extracted articles could be beneficial, the current parser does not make those entries parsable.

# Section sitemaps
Sitemap("https://www.corriere.it/rss/sitemap/Motori.xml"),
Sitemap("https://www.corriere.it/rss/sitemap/Cultura.xml"),
Sitemap("https://vivimilano.corriere.it/sitemap_index.xml"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one seems to lead to an entire different subpage, which is not supported by the current parser (and also doesn't necessarily need to)

src/fundus/publishers/it/__init__.py Outdated Show resolved Hide resolved
@ruggsea ruggsea force-pushed the other-italian-newspapers branch from 005b328 to 3ad42c8 Compare February 4, 2025 18:27
@ruggsea
Copy link
Contributor Author

ruggsea commented Feb 4, 2025

Thank you @addie9800 for the thorough review! I think I should have addressed all comments (and rebased off current master and handled the image attribute):

For Il Giornale:

  • Removed redundant free_access method (already in BaseParser)
  • Removed unused logger
  • Simplified selectors as suggested:
    • Updated summary selector to work without typography--content restriction
    • Simplified paragraph selector to the suggested XPath
  • Fixed type errors using proper HtmlElement handling
  • Used the article provided for test genearion
  • Removed the 404 RSS feed, keeping only the working one

For Corriere della Sera:

  • Kept the existing implementation and added my topics extraction from breadcrumbs

  • Cleaned up the sitemaps configuration:

    • Removed health dictionary sitemap (not parseable)
    • Removed vivimilano sitemap (different subpage)
    • Kept only working dynamic sitemaps
    • Better organized into categories

    Let me know what you think!

Copy link
Collaborator

@addie9800 addie9800 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing this so quickly. I just have a couple comments about the images attribute and then we should be good to go 🚀

def body(self) -> Optional[ArticleBody]:
# Clean up HTML by removing ads and handling em/strong tags
html_string = tostring(self.precomputed.doc).decode("utf-8")
html_string = re.sub(r"</?(em|strong)>", "", html_string)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this earlier, apparently occasionally there are also <cite> elements, that mess up the extraction, so we would need to add them here as well.

Suggested change
html_string = re.sub(r"</?(em|strong)>", "", html_string)
html_string = re.sub(r"</?(em|strong|cite)>", "", html_string)

images = image_extraction(
doc=self.precomputed.doc,
paragraph_selector=self._paragraph_selector,
image_selector=self._image_selector,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The image selector seems to be a bit too restrictive. The images of this article are not extracted: https://www.ilgiornale.it/news/automotive/davanti-spiccano-solite-note-panda-sandero-mercato-gennaio-2433038.html

doc=self.precomputed.doc,
paragraph_selector=self._paragraph_selector,
image_selector=self._image_selector,
author_selector=re.compile(r"(?:Foto:?\s*)?(?P<credits>[^()]+)(?:\s*\([^)]+\))?$"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pattern needs to be modified to select only if an attribution is being made. What I mean by that can exemplarily be seen in this image:

Fundus-Article Cover-Image:
-URL:			 'https://img.ilgcdn.com/sites/default/files/styles/xl/public/foto/2025/02/01/1738434667-azs3g51p0mzhzbpw1kqf-ansa.jpeg?_=1738434667&format=webp'
-Description:	 'Si fingono sordomute per derubarli. Ma le vittime sono due carabinieri'
-Caption:		 None
-Authors:		 ['Si fingono sordomute per derubarli. Ma le vittime sono due carabinieri']
-Versions:		 [300x169, 500x281, 800x450, 1200x675]

Especially if the author attribution is done in the caption, the utility function would remove the match from the caption and save it as the author. But did you see any case, where there was an author attribution done by the publisher? I searched through their website but didn't see any instances. If you also didn't find any instance, we can, I think, safely remove the custom author_selector and modify it accordingly, should we at some point stumble across an article image author attribution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok makes sense!

)

# Try to get cover image from meta tags if no images found
if not images:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, the image_selector should be chosen in a way that makes the fallback unnecessary, which is why I would like to remove the fallback case. The reason is that this would work for the cover image but not subsequent images in the article, making it harder to notice the error and fix it (by modifying the image_selector or creating a new parser version, since the most likely cause of such an issue would be a layout change, which will most likely also require other modifications).

@ruggsea
Copy link
Contributor Author

ruggsea commented Feb 4, 2025

Thanks for the fast feedback. I should have fixed/made more broad the image selection for IlGiornale

  • Fixed image selector to capture all images (it should now work with the automotive article)
  • Removed fallback case for cover images
  • Removed unused author_selector since no attributions found

Also, now the tags are being included as well!

Copy link
Collaborator

@addie9800 addie9800 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great; thank you so much for adding and improving these two publishers 👍

@MaxDall
Copy link
Collaborator

MaxDall commented Feb 5, 2025

@ruggsea Thanks a lot for adding this ❤️ and thanks a lot for the quick and in-depth review @addie9800. I'm gonna merge this now and prepare a new minor release.

@MaxDall MaxDall merged commit 7455d33 into flairNLP:master Feb 5, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants