-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Le Temps | rebuilt consolidated] Remove duplicate issues across GDL and JDG 1992-1998 #142
Comments
GDL and JDG data will not be reingested for the march release, so if I understand correctly, you suggest deleting the data of GDL during this period (1992-1998)? |
I overlooked the fact that it won't be re-ingested. Yes, I believe the GDL issues from 1991-09-01 to 1998 should be excluded from the rebuild. This will prevent these items from being parsed twice in downstream processes and eliminate the need for filter patches during Solr ingestion. |
Noted, I can change the years which are listed in the roadmap, so that I don't compute the rebuilt for the years we don't want. |
Just thought about the fact that this could be challenging to handle currently with manifests (created an issue about it) |
I would just remove the canoncial data for the years from the canonical folder. We could still keep them somewhere. I would rather not patch manifests. For me the manifest should be what you see on s3 is what you get. Otherwise we can easily get into trouble. |
Yes, but if we delete the canonical data on S3, then what's currently in the manifest won't match anymore, and currently manifests are only additive. |
The contents of GDL and JDG are similar from 1991 until 1998, with the segmentation of JDG being better than GDL for this period.
When producing the canonical and rebuilt, the year 1992 (included) until 1998 from GDL could thus be ignored.
The fusion between the two newspapers occurs somewhere in 1991, from what I found on 02 Sep 1991 (cf slack message, but that would be good to double check).
Currently this exclusion is made at Solr indexing time, but it would be good to have it at the beginning of the pipeline.
The text was updated successfully, but these errors were encountered: