[Le Temps | rebuilt consolidated] Remove duplicate issues across GDL and JDG 1992-1998 #142

e-maud · 2025-01-15T10:32:40Z

The contents of GDL and JDG are similar from 1991 until 1998, with the segmentation of JDG being better than GDL for this period.

When producing the canonical and rebuilt, the year 1992 (included) until 1998 from GDL could thus be ignored.

The fusion between the two newspapers occurs somewhere in 1991, from what I found on 02 Sep 1991 (cf slack message, but that would be good to double check).

Currently this exclusion is made at Solr indexing time, but it would be good to have it at the beginning of the pipeline.

piconti · 2025-01-15T15:48:42Z

GDL and JDG data will not be reingested for the march release, so if I understand correctly, you suggest deleting the data of GDL during this period (1992-1998)?
As well as forcing to ignore these years in the ingestion process?

e-maud · 2025-01-15T18:08:58Z

I overlooked the fact that it won't be re-ingested. Yes, I believe the GDL issues from 1991-09-01 to 1998 should be excluded from the rebuild. This will prevent these items from being parsed twice in downstream processes and eliminate the need for filter patches during Solr ingestion.

piconti · 2025-01-16T09:48:12Z

Noted, I can change the years which are listed in the roadmap, so that I don't compute the rebuilt for the years we don't want.
But I still think we should also directly delete the canonical data for these years too, so that the manifests are coherent.

piconti · 2025-01-16T09:54:28Z

Just thought about the fact that this could be challenging to handle currently with manifests (created an issue about it)
However, if the Solr ingestion manifest directly exludes them we could have the right number right away without modifying too much the logic of the manifest computation.
It would still be important to add this option though I think.

simon-clematide · 2025-01-16T13:03:46Z

I would just remove the canoncial data for the years from the canonical folder. We could still keep them somewhere. I would rather not patch manifests. For me the manifest should be what you see on s3 is what you get. Otherwise we can easily get into trouble.

piconti · 2025-01-16T13:26:39Z

Yes, but if we delete the canonical data on S3, then what's currently in the manifest won't match anymore, and currently manifests are only additive.

e-maud added the data issues that are related to the data label Jan 15, 2025

e-maud assigned piconti Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Le Temps | rebuilt consolidated] Remove duplicate issues across GDL and JDG 1992-1998 #142

[Le Temps | rebuilt consolidated] Remove duplicate issues across GDL and JDG 1992-1998 #142

e-maud commented Jan 15, 2025

piconti commented Jan 15, 2025

e-maud commented Jan 15, 2025

piconti commented Jan 16, 2025

piconti commented Jan 16, 2025 •

edited

Loading

simon-clematide commented Jan 16, 2025

piconti commented Jan 16, 2025

[Le Temps | rebuilt consolidated] Remove duplicate issues across GDL and JDG 1992-1998 #142

[Le Temps | rebuilt consolidated] Remove duplicate issues across GDL and JDG 1992-1998 #142

Comments

e-maud commented Jan 15, 2025

piconti commented Jan 15, 2025

e-maud commented Jan 15, 2025

piconti commented Jan 16, 2025

piconti commented Jan 16, 2025 • edited Loading

simon-clematide commented Jan 16, 2025

piconti commented Jan 16, 2025

piconti commented Jan 16, 2025 •

edited

Loading