Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad pub_date_sort for some really old objects from SDR #1319

Open
cbeer opened this issue Feb 5, 2024 · 10 comments · May be fixed by sul-dlss/stanford-mods#148
Open

Bad pub_date_sort for some really old objects from SDR #1319

cbeer opened this issue Feb 5, 2024 · 10 comments · May be fixed by sul-dlss/stanford-mods#148
Assignees
Labels
bug Something isn't working

Comments

@cbeer
Copy link
Member

cbeer commented Feb 5, 2024

E.g. https://searchworks.stanford.edu/view/qb122dq4313.json has a pub_date_sort of 6000 and ends up sorting wrong.

@cbeer cbeer added the bug Something isn't working label Feb 5, 2024
@jcoyne
Copy link
Contributor

jcoyne commented Feb 5, 2024

@cbeer you added a comment here indicating that pub_date_sort is deprecated. Do we still need it?

# deprecated pub_date_sort - use pub_year_isi; pub_date_sort is a string and requires weirdness for bc dates
# can remove after pub_year_isi is populated for all indexing data (i.e. solrmarc, crez) and app code is changed
to_field 'pub_date_sort', stanford_mods(:pub_year_sort_str)

It looks like pub_year_isi is currently only populated for SDR items: http://sul-solr-prod-b.stanford.edu/solr/#/searchworks-prod/query?q=pub_year_isi:*&q.op=OR&indent=true&fl=id. Compare that with the number that have pub_date_sort http://sul-solr-prod-b.stanford.edu/solr/#/searchworks-prod/query?q=pub_date_sort:*&q.op=OR&indent=true&fl=id

Can we copy this code https://github.com/sul-dlss/searchworks_traject_indexer/blob/cdad30e01b4e44880af5c68923464adad8a74e2e/lib/traject/config/folio_config.rb#L831C11-L831C31 to a pub_year_isi and wait for that to index and then use pub_year_isi in Searchworks?

@cbeer
Copy link
Member Author

cbeer commented Feb 5, 2024

Well, we're very much still using it:
https://github.com/sul-dlss/SearchWorks/blob/main/app/controllers/catalog_controller.rb#L460-L461

My vague memory is pub_year_isi is all well and good for MODS records, but ambiguity in the MARC makes it harder to do there.

@dnoneill
Copy link
Contributor

This is happening because what is being returned from https://github.com/sul-dlss/stanford-mods/blob/19af5eb4e1be528f3ee71a18c19a3a99fd435d8c/lib/stanford-mods/concerns/origin_info.rb#L47-L49 the stanford mods method is a Stanford::Mods::Imprint::DateRange. And apparently sorting that class returns a 6000. Super odd but okay. The quickest fix would be to update the pub_year_sort_str

to_field 'pub_date_sort', stanford_mods(:pub_year_sort_str)
to pub_year_int not sure if there are implications of this but it looks like it is the exact same method except it makes sure there is actually a date and if there is a daterange it will pull the data out of it. https://github.com/sul-dlss/stanford-mods/blob/19af5eb4e1be528f3ee71a18c19a3a99fd435d8c/lib/stanford-mods/concerns/origin_info.rb#L19-L38.

@dnoneill dnoneill self-assigned this Feb 12, 2024
@cbeer
Copy link
Member Author

cbeer commented Feb 12, 2024

I think the problem with pub_year_int is sorting things with approximate years to the "right" place, e.g. I think we want "19--" to sort after anything with a known year 1000 - 1999.

@dnoneill
Copy link
Contributor

@cbeer are you saying we want pub_date_sort to return a string and pub_year_int is always going to return an int and it won't allow for something like 195x? I am not sure I understand what specifically what in the code for pub_year_int is going to cause a problem. From what I can tell it will it checks to see if it is a daterange. If it is it will get the start or stop date. From there it will it will get the date value of the date from earliest_preferred_date (which is what pub_year_sort_str) is getting. From there is checks that the date is a value, then it will check if the date is an interval if it is it will get the from.year value, otherwise it will get the year value. So I guess I am wondering what part of that will cause problems? Because at the very least we need to integrate some of that into the pub_year_sort_str function.

@cbeer
Copy link
Member Author

cbeer commented Feb 12, 2024

Check out the pub_date_sort we use for MARC records, and in particular the way we handle some of the non-numeric data:

# hyphens sort before 0, so the lexical sorting will be correct. I think.
year ||= if f008_bytes7to10 =~ (/\d\d[u-][u-]/) && (record['008'].value[7..8] <= Time.now.year.to_s[0..1])
"#{record['008'].value[7..8]}--"
end
# colons sort after 9, so the lexical sorting will be correct. I think.
# NOTE: the solrmarc code has this comment, and yet still uses hyphens below; maybe a bug?
year ||= if f008_bytes11to14 =~ (/\d\d[u-][u-]/) && (record['008'].value[11..12] <= Time.now.year.to_s[0..1])
"#{record['008'].value[11..12]}--"
end

I'm not sure how we handle that kind of requirement with a pub_year_int?

@dnoneill
Copy link
Contributor

dnoneill commented Feb 12, 2024

from what I can tell we aren't doing that at all with pub_year_sort_str it just looks at a bunch for mods fields (dateIssued, dateCreated, dateCaptured) and then returns the smallest date, but it is making an assumption that all those dates are a DateValue field.

https://github.com/sul-dlss/stanford-mods/blob/19af5eb4e1be528f3ee71a18c19a3a99fd435d8c/lib/stanford-mods/concerns/origin_info.rb#L47-L49

@cbeer
Copy link
Member Author

cbeer commented Feb 13, 2024

Sure, but the MARC and MODS data are both indexing into the same field so we can give a consistent sort across all records. I'm not sure we can sort by pub_year_int for the MODS stuff and pub_date_sort for the MARC stuff and have things interfile and also continue to sort MARC records according to those rules for hyphens and colons.

@dnoneill
Copy link
Contributor

Do you have time for a huddle on this latter? I get we can't use pub_year_int for this but I would like some context around which part we can't use so I an update the pub_date_sort method so it works correctly.

@dnoneill dnoneill linked a pull request Feb 13, 2024 that will close this issue
@dnoneill
Copy link
Contributor

Okay I thought about it. The only other way I could think of was to allow ints but then fix what we are getting from the MARC records so that they are numbers. i.e. if we got 19xx we would convert to 1900 but marc is so variable that I think that would be way more messy and not provide us with a better result. I have updated the BCE conversion to -10000 and I change *-1 to abs just because I think its neater.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants