Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hashing #41

Open
ltratt opened this issue Jun 7, 2020 · 11 comments
Open

Hashing #41

ltratt opened this issue Jun 7, 2020 · 11 comments

Comments

@ltratt
Copy link

ltratt commented Jun 7, 2020

Hashing in Java SOM seems to be broken at least for doubles:

hashcode = (
    run = (
        1.1 hashcode println.
        1.1 hashcode println.
    )
)

prints different numbers when I run it:

968514068
1360767589

i.e. this program prints:

hashcode = (
    run = (
        self cmp: 'a' with: 'a'.
        self cmp: 123 with: 123.
        self cmp: self with: self.
        self cmp: 1.1 with: 1.1.
    )

    cmp: x with: y = (
        ((x hashcode) = (y hashcode)) println.
    )
)
true
true
true
false

I'm reasonably sure it's supposed to be all trues?

I wondered if this is why Integer.som defines hashcode = (^self)? On a related note, Integer's hash function is terrible and I'd much rather provide a better one: would it be better for important types such as this to have the VM provide the hash (since then the hard work of providing a decent hash is outsourced to someone else). Can we maybe get rid of the hashcode = (^self) from Integer?

@smarr
Copy link
Member

smarr commented Jun 7, 2020

My interpretation is that you assume that the hashcode of an object is derived from its value.
While not unreasonable, I don't think anyone* ever promised that to be the case. (*SOM's implementation)

Double does not override hashcode, so unreasonable or not, Double has the same way to determine a hash code as any other object, likely based on object identity, memory location, etc.

So, no, by definition, it is not "broken".

Since you don't define what's better, or what's terrible, I have to disregard the issue as random ramblings of someone grumpy because the sun doesn't shine...

If you tell me what you would consider "better", we may have a way to constructively discuss the issue.

@ltratt
Copy link
Author

ltratt commented Jun 7, 2020

Grumpy? Guilty as charged!

But I was mostly going on this from Object.som:

    " If you override =, you MUST override hashcode as well.  The rule
      obj1 = obj2   =>  obj1 hashcode = obj2 hashcode
      must be valid for all objects, or Hashtable will not work"
    =  other    = ( ^self == other )

So Double's hashing is broken in the sense that it'll give incorrect results if you use it in a hashtable, where one might reasonably assume that such a primitive type would just "work".

I agree that for non-primitive types, it's unreasonable to assume anything very much about hashing.

@ltratt
Copy link
Author

ltratt commented Jun 7, 2020

I take your point about me not providing an alternative. So let me do that. It seems reasonable to me that SOM should define "sensible" hashes for primitive types (ints, doubles, strings); any other types (including insances) would come under the "no guarantees given about the hash at all" category.

@smarr
Copy link
Member

smarr commented Jun 7, 2020

Well, Double not overriding hashcode while overriding = is a very good argument, I wasn't aware off. So, yes, I agree with it being broken, and should be fixed.

Though, what's a better hash for integer?

@smarr
Copy link
Member

smarr commented Jun 7, 2020

Integer and String provide "content-based hash codes".

String uses at least on CSOM the following:
https://github.com/SOM-st/CSOM/blob/e0408285ba8d1fcf47607e2bb66de21432b27e58/src/misc/defs.h#L61-L67

SOM-java uses Java's String.hashCode(), which if I read the documentation correctly, is where CSOM got its hash code from: https://docs.oracle.com/javase/7/docs/api/java/lang/String.html#hashCode()

@ltratt
Copy link
Author

ltratt commented Jun 7, 2020

Gosh, there are lots of integer hashing algorithms, and I just use other people's! https://en.wikipedia.org/wiki/Hash_function#Hashing_integer_data_types has some thoughts on this. I've found FNV to be decent for ints http://www.isthe.com/chongo/tech/comp/fnv/index.html but I don't know enough to claim that it's the absolute best.

@smarr
Copy link
Member

smarr commented Jun 7, 2020

Ok, I guess this means something like FNV-1a:

hash = offset_basis
for each octet_of_data to be hashed
 hash = hash xor octet_of_data
 hash = hash * FNV_prime
return hash

Where offset_basis is:
32 bit offset_basis = 2166136261
64 bit offset_basis = 14695981039346656037

And FNV_prime:

32 bit FNV_prime = 224 + 28 + 0x93 = 16777619
64 bit FNV_prime = 240 + 28 + 0xb3 = 1099511628211

@ltratt
Copy link
Author

ltratt commented Jun 7, 2020

Personally I don't think a language needs to define what the hashing algorithm is, other than guaranteeing when equality and hashing are expected to give compatible answers. FNV is just, AFAIK, a decent algorithm for small data like ints.

@smarr
Copy link
Member

smarr commented Jun 7, 2020

Well, and then there's benchmarking. Do you want this to be another variable you have to account for?

I am not saying this should be spec'ed and tested. Though, if you want this to be changed in the core-lib, we'll need implementations. And based on what I read quickly, and what you said, the above seems a likely candidate.

On slightly related note: SOM implementations are allowed to replace core-lib methods with VM-level substitutes. So, you can replace Integer>>hashcode with a better hash right now.

@ltratt
Copy link
Author

ltratt commented Jun 7, 2020

Well, and then there's benchmarking. Do you want this to be another variable you have to account for?

I think this sort of non-determinism is endemic in modern software. For example, if you use a implementation language level hashmap (in Java or Rust or whatever) in your interpreter, you'll have this problem anyway. The only realistic solution we have to this at the moment, IMHO, is just to run things more often and get decent confidence intervals. I'd love to do better than that, but it's not easy.

On slightly related note: SOM implementations are allowed to replace core-lib methods with VM-level substitutes.

I'm a bit surprised at that, as this afternoon I discovered (but haven't accounted for!) the fact that SOM programs can tell which methods are user-level or primitives (TestSuite/ClassStructureTest.som). It also means that if a user alters such a method, the VM will silently discard their behaviour. That seems a bit un-SOMy?

@smarr
Copy link
Member

smarr commented Jun 7, 2020

SOM programs can tell which methods are user-level or primitives

This should be a representation of what the source code says.
That means, it should be base on using = primitive or = (...).

That seems a bit un-SOMy?

In an ideal world, you can cheat, as long as you don't get caught.

Elsewhere, don't remember where, there was a discussion on checking that the method in the core-lib is what one expects it to be.
So that code changes are not silently ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants