Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Le Temps | rebuilt consolidated] Remove duplicate issues across GDL and JDG 1992-1998 #142

Open
e-maud opened this issue Jan 15, 2025 · 6 comments
Assignees
Labels
data issues that are related to the data

Comments

@e-maud
Copy link
Member

e-maud commented Jan 15, 2025

The contents of GDL and JDG are similar from 1991 until 1998, with the segmentation of JDG being better than GDL for this period.

When producing the canonical and rebuilt, the year 1992 (included) until 1998 from GDL could thus be ignored.

The fusion between the two newspapers occurs somewhere in 1991, from what I found on 02 Sep 1991 (cf slack message, but that would be good to double check).

Currently this exclusion is made at Solr indexing time, but it would be good to have it at the beginning of the pipeline.

@e-maud e-maud added the data issues that are related to the data label Jan 15, 2025
@piconti
Copy link
Member

piconti commented Jan 15, 2025

GDL and JDG data will not be reingested for the march release, so if I understand correctly, you suggest deleting the data of GDL during this period (1992-1998)?
As well as forcing to ignore these years in the ingestion process?

@e-maud
Copy link
Member Author

e-maud commented Jan 15, 2025

I overlooked the fact that it won't be re-ingested. Yes, I believe the GDL issues from 1991-09-01 to 1998 should be excluded from the rebuild. This will prevent these items from being parsed twice in downstream processes and eliminate the need for filter patches during Solr ingestion.

@piconti
Copy link
Member

piconti commented Jan 16, 2025

Noted, I can change the years which are listed in the roadmap, so that I don't compute the rebuilt for the years we don't want.
But I still think we should also directly delete the canonical data for these years too, so that the manifests are coherent.

@piconti
Copy link
Member

piconti commented Jan 16, 2025

Just thought about the fact that this could be challenging to handle currently with manifests (created an issue about it)
However, if the Solr ingestion manifest directly exludes them we could have the right number right away without modifying too much the logic of the manifest computation.
It would still be important to add this option though I think.

@simon-clematide
Copy link

I would just remove the canoncial data for the years from the canonical folder. We could still keep them somewhere. I would rather not patch manifests. For me the manifest should be what you see on s3 is what you get. Otherwise we can easily get into trouble.

@piconti
Copy link
Member

piconti commented Jan 16, 2025

Yes, but if we delete the canonical data on S3, then what's currently in the manifest won't match anymore, and currently manifests are only additive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data issues that are related to the data
Projects
None yet
Development

No branches or pull requests

3 participants