Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

f16<->double #988

Open
gambiteer opened this issue Jun 1, 2024 · 8 comments
Open

f16<->double #988

gambiteer opened this issue Jun 1, 2024 · 8 comments

Comments

@gambiteer
Copy link
Contributor

Your code to convert between f16 and f64 is a bit opaque to me:

double sexp_half_to_double(unsigned short x) {
  unsigned int e = (x&0x7C00)>>10,
    m = (x&0x03FF)<<13,
    v = float_as_int((float)m)>>23;
  return int_as_float((x&0x8000)<<16 | (e!=0)*((e+112)<<23|m) | ((e==0)&(m!=0))*((v-37)<<23|((m<<(150-v))&0x007FE000)));
}

unsigned short sexp_double_to_half(double x) {
  unsigned int b = float_as_int(x)+0x00001000,
    e = (b&0x7F800000)>>23,
    m = b&0x007FFFFF;
  return (b&0x80000000)>>16 | (e>112)*((((e-112)<<10)&0x7C00)|m>>13) | ((e<113)&(e>101))*((((0x007FF000+m)>>(125-e))+1)>>1) | (e>143)*0x7FFF;
}

It appears that it may be much faster than my code (which uses flscalbn, flilogb, and many tests and jumps).

I have a test code for my f16->double and double->f16, where the f16 representation is an unsigned 16-bit int.
So two questions:

  1. Can you give a reference for your code?
  2. Are the C routines sexp_double_to_half and sexp_half_to_double exported to the Scheme level, or are they just used internally to support f16vector-ref and f16vector-set?

Thanks.

Brad

@ashinn
Copy link
Owner

ashinn commented Jun 1, 2024

Source: https://arxiv.org/abs/2112.08926

It was actually a place holder until I could derive something myself but I may just leave it as is.

@ashinn
Copy link
Owner

ashinn commented Jun 1, 2024

There is also some interesting discussion in: https://stackoverflow.com/questions/1659440/32-bit-to-16-bit-floating-point-conversion. Note one answer is just to convert the 3 float components separately, capping the exponent and rounding/truncating the mantissa depending on direction, then recombining the components:

Half to float:
float f = ((h&0x8000)<<16) | (((h&0x7c00)+0x1C000)<<13) | ((h&0x03FF)<<13);

Float to half:
uint32_t x = *((uint32_t*)&f);
uint16_t h = ((x>>16)&0x8000)|((((x&0x7f800000)-0x38000000)>>13)&0x7c00)|((x>>13)&0x03ff);

which is in the spirit as the above implementation but a little too simple, not handling denormalized values or infinities, etc.

@gambiteer
Copy link
Contributor Author

Thanks. I used Gambit's C interface to test the code.

It seems that the code always rounds away from zero for doubles exactly between two representable numbers when converting to f16 instead of rounding to even.

It also gets the encoding of infinities and NaNs wrong, it should have

> (double->f16 +inf.0)
31744
> (double->f16 +nan.0)
32767

(the second coding isn't unique, but the mantissa needs to be nonzero) while the code returns

> (double->f16 +inf.0)
32767
> (double->f16 +nan.0)
32768
> (double->f16 -0.0)
32768
> (f16->double 32768)
-0.

32767 is the encoding for +nan.0 and 32768 is the encoding for -0.0.

@gambiteer
Copy link
Contributor Author

I've put a file with my test data at http://www.math.purdue.edu/~lucier/f16-data.txt .
It contains three lists.
The first is a list of doubles to be converted to f16 codes.
The second is a list of the codes associated with those doubles.
And the third is a list of reconstructed doubles from those codes.
If you negate all the data in the first list, the codes should just increase by 32768 (except perhaps for +nan.0).

I suspect that because of the implicit double->float conversion in sexp_double_to_half, the code might be affected by double rounding, too.

@ashinn
Copy link
Owner

ashinn commented Jun 1, 2024

The rounding I'm not too concerned about, but I'll fix the nan handling.

@gambiteer
Copy link
Contributor Author

OK, here are some examples where the finiteness of your results and my results differ.
In each row, the last of the three numbers is the input, the first is your result (double->f16->double), the second is my result.
65520.0 should overflow to +inf.0 in this format. The second group has the sign changed withflcopysign from SRFI 144 (so that doesn't make a difference in the print output of +nan.0).

finiteness_difference 131008.0 +inf.0 +inf.0
finiteness_difference -131008.0 +nan.0 +nan.0
finiteness_difference 65536.0 +inf.0 65520.0

finiteness_difference -131008.0 -inf.0 -inf.0
finiteness_difference -131008.0 +nan.0 +nan.0
finiteness_difference -65536.0 -inf.0 -65520.0

@ashinn
Copy link
Owner

ashinn commented Jun 2, 2024

Sorry, you were asking how to test this and I didn't reply. The code is exposed in the C API, but from Scheme you have to go indirectly via uniform vectors, for example to test the round trip on +inf.0:

$ chibi-scheme -msrfi.160.mini -p'(f16vector-ref #f16(+inf.0) 0)'
131008.0

ashinn added a commit that referenced this issue Jun 2, 2024
@gambiteer
Copy link
Contributor Author

I figured that out, but it seemed that you could test only double -> f16 code -> double round trips, and if you found a problem you couldn’t tell whether the problem is in the coding or the decoding. That’s why I ended up testing the C code you’re using with Gambit’s C interface.

Srfi 231’s sample implementation has functions for coding and decoding (as parameterized Scheme macros), you might find them of interest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants