Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing error for citations with defendant 'Thompson' #174

Open
ERosendo opened this issue Mar 28, 2024 · 5 comments · May be fixed by #190
Open

Parsing error for citations with defendant 'Thompson' #174

ERosendo opened this issue Mar 28, 2024 · 5 comments · May be fixed by #190

Comments

@ERosendo
Copy link
Contributor

ERosendo commented Mar 28, 2024

In issue #3924, we identified a bug in Eyecite's parsing method when the defendant's last name is 'Thompson'.

For example, for the citation 'Shapiro v. Thompson, 394 U. S. 618':

  • Expected output: volume: 394, reporter: 'U.S.', page: '618'
  • Actual output: volume: None, reporter: 'Thompson', page: '394'

Other examples of inputs that are incorrectly parsed are: Adams v. Thompson, 560 F. Supp. 894 and Mozena v. Thompson, 44 A.2d 276.

I've been using the first example to debug this issue, and noticed that Eyecite identifies two tokens within the input string: "Thompson's Unreported Cases (TN)" and "United States Supreme Court Reports.". The problem arises because these tokens overlap (both include "394") and Eyecite's tokenize method prioritizes the rightmost token when encountering overlaps, leading to this results.

@ERosendo ERosendo changed the title parsing error for citations with defendant 'Thompson' Parsing error for citations with defendant 'Thompson' Mar 28, 2024
@mlissner
Copy link
Member

Any idea how easy this is to solve so that it identifies each?

@mlissner
Copy link
Member

mlissner commented Apr 1, 2024

Per discussion today, seems to be happening when citations appear to overlap. The simple solution here is to find both citations that overlap and then filter out the one that's incomplete.

@mlissner mlissner moved this to Main Backlog in @erosendo's backlog Apr 1, 2024
@ERosendo ERosendo moved this from Main Backlog to Bots Backlog in @erosendo's backlog Apr 15, 2024
@flooie flooie moved this to General Backlog in Case Law Sprint Nov 19, 2024
@flooie flooie moved this from General Backlog to To Do in Case Law Sprint Dec 17, 2024
@quevon24
Copy link
Member

quevon24 commented Jan 7, 2025

The problem here is that we have Thompson as a nominative reporter (optional volume), that's why it is detected as a citation.

We could detect the overlap and see which one is incomplete but according to the reporters-db Thompson regex, Thompson , 394 is a valid citation, we get two valid citations: Thompson , 394 and 394 U. S. 618

This can be replicated with other nominative reporters like:

  • Foo v. Cooke (Tennessee Reports, Cooke)
  • Foo v. Holmes (Holmes Circuit Court Reports (US))
  • Foo v. McCahon (Kansas Reports, McCahon)
  • Foo v. Olcott (Olcott's Admiralty Reports)
  • Foo v. Taney (Taney's United States (US) Circuit Court Reports)

@mlissner
Copy link
Member

mlissner commented Jan 7, 2025

Seems like some post processing could detect and remedy this. The Thompson one may be technically complete, but it's the same data. So maybe: If there's overlap and the data overlaps too, then....

@flooie
Copy link
Contributor

flooie commented Jan 7, 2025

@mlissner I like your solution -

if there is an overlapping span - we can check if one is a complete citation and if one is nominative and choose the complete citation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Bots Backlog
Status: To Do
Status: No status
Development

Successfully merging a pull request may close this issue.

4 participants