-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
f16<->double #988
Comments
Source: https://arxiv.org/abs/2112.08926 It was actually a place holder until I could derive something myself but I may just leave it as is. |
There is also some interesting discussion in: https://stackoverflow.com/questions/1659440/32-bit-to-16-bit-floating-point-conversion. Note one answer is just to convert the 3 float components separately, capping the exponent and rounding/truncating the mantissa depending on direction, then recombining the components:
which is in the spirit as the above implementation but a little too simple, not handling denormalized values or infinities, etc. |
Thanks. I used Gambit's C interface to test the code. It seems that the code always rounds away from zero for doubles exactly between two representable numbers when converting to f16 instead of rounding to even. It also gets the encoding of infinities and NaNs wrong, it should have
(the second coding isn't unique, but the mantissa needs to be nonzero) while the code returns
32767 is the encoding for |
I've put a file with my test data at http://www.math.purdue.edu/~lucier/f16-data.txt . I suspect that because of the implicit double->float conversion in |
The rounding I'm not too concerned about, but I'll fix the nan handling. |
OK, here are some examples where the finiteness of your results and my results differ.
|
Sorry, you were asking how to test this and I didn't reply. The code is exposed in the C API, but from Scheme you have to go indirectly via uniform vectors, for example to test the round trip on +inf.0:
|
I figured that out, but it seemed that you could test only double -> f16 code -> double round trips, and if you found a problem you couldn’t tell whether the problem is in the coding or the decoding. That’s why I ended up testing the C code you’re using with Gambit’s C interface. Srfi 231’s sample implementation has functions for coding and decoding (as parameterized Scheme macros), you might find them of interest. |
Your code to convert between
f16
andf64
is a bit opaque to me:It appears that it may be much faster than my code (which uses
flscalbn
,flilogb
, and many tests and jumps).I have a test code for my
f16->double
anddouble->f16
, where thef16
representation is an unsigned 16-bit int.So two questions:
sexp_double_to_half
andsexp_half_to_double
exported to the Scheme level, or are they just used internally to supportf16vector-ref
andf16vector-set?
Thanks.
Brad
The text was updated successfully, but these errors were encountered: