f16<->double #988

gambiteer · 2024-06-01T04:20:05Z

Your code to convert between f16 and f64 is a bit opaque to me:

double sexp_half_to_double(unsigned short x) {
  unsigned int e = (x&0x7C00)>>10,
    m = (x&0x03FF)<<13,
    v = float_as_int((float)m)>>23;
  return int_as_float((x&0x8000)<<16 | (e!=0)*((e+112)<<23|m) | ((e==0)&(m!=0))*((v-37)<<23|((m<<(150-v))&0x007FE000)));
}

unsigned short sexp_double_to_half(double x) {
  unsigned int b = float_as_int(x)+0x00001000,
    e = (b&0x7F800000)>>23,
    m = b&0x007FFFFF;
  return (b&0x80000000)>>16 | (e>112)*((((e-112)<<10)&0x7C00)|m>>13) | ((e<113)&(e>101))*((((0x007FF000+m)>>(125-e))+1)>>1) | (e>143)*0x7FFF;
}

It appears that it may be much faster than my code (which uses flscalbn, flilogb, and many tests and jumps).

I have a test code for my f16->double and double->f16, where the f16 representation is an unsigned 16-bit int.
So two questions:

Can you give a reference for your code?
Are the C routines sexp_double_to_half and sexp_half_to_double exported to the Scheme level, or are they just used internally to support f16vector-ref and f16vector-set?

Thanks.

Brad

The text was updated successfully, but these errors were encountered:

ashinn · 2024-06-01T13:01:20Z

Source: https://arxiv.org/abs/2112.08926

It was actually a place holder until I could derive something myself but I may just leave it as is.

ashinn · 2024-06-01T13:09:21Z

There is also some interesting discussion in: https://stackoverflow.com/questions/1659440/32-bit-to-16-bit-floating-point-conversion. Note one answer is just to convert the 3 float components separately, capping the exponent and rounding/truncating the mantissa depending on direction, then recombining the components:

Half to float:
float f = ((h&0x8000)<<16) | (((h&0x7c00)+0x1C000)<<13) | ((h&0x03FF)<<13);

Float to half:
uint32_t x = *((uint32_t*)&f);
uint16_t h = ((x>>16)&0x8000)|((((x&0x7f800000)-0x38000000)>>13)&0x7c00)|((x>>13)&0x03ff);

which is in the spirit as the above implementation but a little too simple, not handling denormalized values or infinities, etc.

gambiteer · 2024-06-01T15:27:59Z

Thanks. I used Gambit's C interface to test the code.

It seems that the code always rounds away from zero for doubles exactly between two representable numbers when converting to f16 instead of rounding to even.

It also gets the encoding of infinities and NaNs wrong, it should have

> (double->f16 +inf.0)
31744
> (double->f16 +nan.0)
32767

(the second coding isn't unique, but the mantissa needs to be nonzero) while the code returns

> (double->f16 +inf.0)
32767
> (double->f16 +nan.0)
32768
> (double->f16 -0.0)
32768
> (f16->double 32768)
-0.

32767 is the encoding for +nan.0 and 32768 is the encoding for -0.0.

gambiteer · 2024-06-01T17:45:14Z

I've put a file with my test data at http://www.math.purdue.edu/~lucier/f16-data.txt .
It contains three lists.
The first is a list of doubles to be converted to f16 codes.
The second is a list of the codes associated with those doubles.
And the third is a list of reconstructed doubles from those codes.
If you negate all the data in the first list, the codes should just increase by 32768 (except perhaps for +nan.0).

I suspect that because of the implicit double->float conversion in sexp_double_to_half, the code might be affected by double rounding, too.

ashinn · 2024-06-01T22:27:00Z

The rounding I'm not too concerned about, but I'll fix the nan handling.

gambiteer · 2024-06-02T01:05:25Z

OK, here are some examples where the finiteness of your results and my results differ.
In each row, the last of the three numbers is the input, the first is your result (double->f16->double), the second is my result.
65520.0 should overflow to +inf.0 in this format. The second group has the sign changed withflcopysign from SRFI 144 (so that doesn't make a difference in the print output of +nan.0).

finiteness_difference 131008.0 +inf.0 +inf.0
finiteness_difference -131008.0 +nan.0 +nan.0
finiteness_difference 65536.0 +inf.0 65520.0

finiteness_difference -131008.0 -inf.0 -inf.0
finiteness_difference -131008.0 +nan.0 +nan.0
finiteness_difference -65536.0 -inf.0 -65520.0

ashinn · 2024-06-02T13:16:37Z

Sorry, you were asking how to test this and I didn't reply. The code is exposed in the C API, but from Scheme you have to go indirectly via uniform vectors, for example to test the round trip on +inf.0:

$ chibi-scheme -msrfi.160.mini -p'(f16vector-ref #f16(+inf.0) 0)'
131008.0

Issue #988.

gambiteer · 2024-06-02T14:46:18Z

I figured that out, but it seemed that you could test only double -> f16 code -> double round trips, and if you found a problem you couldn’t tell whether the problem is in the coding or the decoding. That’s why I ended up testing the C code you’re using with Gambit’s C interface.

Srfi 231’s sample implementation has functions for coding and decoding (as parameterized Scheme macros), you might find them of interest.

ashinn added a commit that referenced this issue Jun 2, 2024

Tentatively manually encoding non-finite f16 values.

d4028f9

Issue #988.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

f16<->double #988

f16<->double #988

gambiteer commented Jun 1, 2024

ashinn commented Jun 1, 2024 •

edited

Loading

ashinn commented Jun 1, 2024

gambiteer commented Jun 1, 2024

gambiteer commented Jun 1, 2024

ashinn commented Jun 1, 2024

gambiteer commented Jun 2, 2024

ashinn commented Jun 2, 2024

gambiteer commented Jun 2, 2024

f16<->double #988

f16<->double #988

Comments

gambiteer commented Jun 1, 2024

ashinn commented Jun 1, 2024 • edited Loading

ashinn commented Jun 1, 2024

gambiteer commented Jun 1, 2024

gambiteer commented Jun 1, 2024

ashinn commented Jun 1, 2024

gambiteer commented Jun 2, 2024

ashinn commented Jun 2, 2024

gambiteer commented Jun 2, 2024

ashinn commented Jun 1, 2024 •

edited

Loading