Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 147, 154 - Improve citation object hashing behavior #155

Merged
merged 11 commits into from
Sep 22, 2023

Conversation

mattdahl
Copy link
Contributor

@mattdahl mattdahl commented Jul 15, 2023

This PR does three things:

  1. Citations with normalized/corrected reporters (e.g., 1 U.S. 1 versus 1 U. S. 1) are now treated as equal (FullCaseCitation.group have badly formatted reporter  #147)
  2. Citations with nominative reporters (e.g., 5 U.S. 137 versus 5 U.S. (1 Cranch) 137 are now treated as equal (Citations with and without nominative reporters are not considered equal #154)
  3. Citation hashes are now reproducible/deterministic across runs (FullCaseCitation.group have badly formatted reporter  #147 (comment))

In terms of implementation, here's what's changed:

  • The old comparison_hash() method that a few classes had has been removed entirely. This was originally intended to essentially be a private method for Resource objects to call, but the name was confusing and could have been misinterpreted by a causal user as a public method that should be used for general comparisons. Going forward, anyone should simply use the built-in hash() method for everything, which is the most natural approach anyway.
  • CitationBase classes and their subclasses now all use the @dataclass(eq=False, unsafe_hash=False) decorator. Previously both eq and unsafe_hash were set to True. As the keyword argument implies, this was somewhat... unsafe. Or at least confusing. See here for a detailed discussion explaining how dataclass objects automatically try to create hash functions when these arguments are True. I have changed this so none of this implicit stuff is happening anymore; instead, we explicitly define __eq__() and __hash__() ourselves on the CitationBase class, which creates a clear record of how the hashing is actually happening. (These methods are then inherited by the citation subclasses, though ResourceCitation and CaseCitation classes override the __hash__() method in order to maintain conformity with existing behavior for those special cases.
  • To create persistent hashes, I use hashlib.sha256. However, the built-in hash() function will only output hashes that are 32 bits long, so I truncate the sha256 hash using c_int32() from the ctypes standard library. By doing this truncation ourselves, we ensure that calling hash(obj) and obj.__hash__() both return the same hash (for a given object).
  • This behavior was not previously well-defined, but I have made it so IdCitation and UnknownCitation objects are sui generis. Since we don't really know anything about what these objects point to, I think it would be dangerous to ever hash two of them the same.

If merged, this PR supersedes #148. Sorry again @overmode for the previous confusion, this turned out to be quite complicated. I have used some of your logic and tests in this PR, and it should achieve all the functionality you were looking for.

Feedback welcome!

@mattdahl
Copy link
Contributor Author

(It looks like the benchmark action is not working again.)

@mlissner
Copy link
Member

@flooie can you please help get this reviewed? @overmode, I think you might want to take a look too, even if you only glance, since I think it's fixing several bugs you reported.

@overmode
Copy link
Contributor

Awesome, thanks for implementing the changes and notifying me.

It has been a while since last time I took a look at eyecite's source code but the PR looks good to me. You might want to document, for each class, what keys are in the groups, as it would make the comparison more transparent.

My colleague @khakhlyuk will also take a look, he is currently working with eyecite.

@khakhlyuk
Copy link

khakhlyuk commented Jul 18, 2023

great work, thank you for implementing this, very useful and very much needed!

The code looks good to me.

The only issue I see is using 32bits for hashes. I am working with a dataset that contains up to 10M unique citations. With a 32bit hash and a dataset of this size hash collisions are inevitable. Even for 100k citations, the probability of a single hash collision is 69%. I've calculated the probability here: https://kevingal.com/apps/collision.html

Is python's default hash() output of 32bits a limitation? I have several thoughts on this.

  1. hash() returns 64-bit ints on 64-bit platforms anyway
  2. i have tried using larger ints as output of a hash function and it works. I have experimented with c_int64 and even using the 256-bit hash directly - int.from_bytes(hashlib.sha256(json_bytes).digest(), byteorder="big"). I have used Citation objects with 64-bit and 256-bit hash as keys for dict and set, everything work fine.
  3. afaik when creating sets or dicts, python will apply modulo to the hash value. That's why it does not make a very big difference if the hash value is 32bit, 64bit or even 256 bit.

Is there another reason to limit output to a 32-bit int? If not, it would be really nice to change the hash to 64bit ints in my opinion.

@mattdahl
Copy link
Contributor Author

3. afaik when creating sets or dicts, python will apply modulo to the hash value. That's why it does not make a very big difference if the hash value is 32bit, 64bit or even 256 bit.

Ah, this is interesting! I did not realize, but for numeric types, you are correct that it just performs a modulo reduction: https://github.com/python/cpython/blob/main/Python/pyhash.c#L34

In that case, I will remove the truncation logic and let hash() do its own thing. The only problem I foresee with this approach is that the hashes will not be the same across 32 and 64 bit machines (because of the way the modulo is implemented), but I don't think that's a serious concern.

Thanks for pointing this out.

@mattdahl
Copy link
Contributor Author

mattdahl commented Jul 18, 2023

All right, 0a30559 removes the truncation, exploiting the fact that calling hash() twice on an arbitrarily large integer (e.g., hash(hash(123))) guarantees reproducibility (because of that modulo trick). Let me know what you think! (This means that the hashes should be 64-bit on 64-bit machines.)

@mattdahl
Copy link
Contributor Author

@overmode I agree that adding documentation of the expected group keys would be good, I'll get to that soon.

@mlissner
Copy link
Member

Thanks all. @flooie final review rests with you, but if the other folks on this thread want to take another look at the latest commits first, I think that'd be great. Thank you all!

@khakhlyuk
Copy link

i like the trick with applying an additional hash(). Good solution!
LGTM

@mattdahl
Copy link
Contributor Author

I added some documentation spelling out the expected content of self.groups for the different citation classes, as suggested by @overmode . As I was doing this, however, I realized that the volume key is not guaranteed to exist for CaseCitation objects. (The reporter and page keys are, at least they should be, because of a test we have in reporters_db. This is also reflected in the new documentation.) Thus, my previous code would crash for e.g. certain tax court citations. I've fixed that bug now too.

@mattdahl
Copy link
Contributor Author

Comrade @flooie, just wanted to bump this if you have a chance to review and merge.

@flooie flooie merged commit 65832f9 into freelawproject:main Sep 22, 2023
@mattdahl mattdahl mentioned this pull request Feb 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants