Skip to content

Latest commit

 

History

History
634 lines (554 loc) · 26.6 KB

BOSE.md

File metadata and controls

634 lines (554 loc) · 26.6 KB

Binary Octet-Stream Encoding (BOSE)

For more-efficient network transmission, we propose an octet-stream encoding which is isomorphic to JSON. As with ASCII text, there are multiple ways to encode the same abstract JSON value. All valid encodings produce equivalent JSON values, although some representation details are lost. Every valid JSON value can be represented in BOSE. Values may be round-tripped without semantic loss. Arbitrarily large values are fully supported. Decimal values can be represented exactly, without loss due to base-2 translation.

Example

Consider the following JSON Object:

{
    "space" : {
        "origin" : [ -40, -20 ],
        "extent" : [ 600, 460 ]
    },
    "shapes" : [
        {
            "origin" : [ 5, 3 ],
            "extent" : [ 21, 13 ]
        },
        {
            "origin" : [ 8, 5 ],
            "extent" : [ 13, 8 ]
        }
    ]
}

One possible BOSE representation would be:

0000:  07 d0 82 0a 85 73 70 61  63 65 05 a0 0b 86 6f 72  |.....space....or|
0010:  69 67 69 6e 06 83 82 58  6c 0b 86 65 78 74 65 6e  |igin...Xl..exten|
0020:  74 06 89 82 10 02 58 02  10 02 cc 01 0a 86 73 68  |t.....X.......sh|
0030:  61 70 65 73 04 9c 05 8c  09 00 04 82 85 83 09 01  |apes............|
0040:  04 82 95 8d 05 8c 09 00  04 82 88 85 09 01 04 82  |................|
0050:  8d 88                                             |..              |

A corresponding structure in "C" could be:

#include "bose.h"
octet_t buf[] = {
    object_n, n_80, n_2,
        utf8, n_5, 's', 'p', 'a', 'c', 'e',
        object, n_32,
            utf8_mem, n_6, 'o', 'r', 'i', 'g', 'i', 'n',
            array_n, n_3, n_2,
                n_m40,
                n_m20,
            utf8_mem, n_6, 'e', 'x', 't', 'e', 'n', 't',
            array_n, n_9, n_2,
                p_int_0, n_2, 600 & 0xFF, 600 >> 8,
                p_int_0, n_2, 460 & 0xFF, 460 >> 8,
        utf8, n_6, 's', 'h', 'a', 'p', 'e', 's',
        array, n_28,
            object, n_12,
                mem_ref, 0, array, n_2, n_5, n_3,
                mem_ref, 1, array, n_2, n_21, n_13,
            object, n_12,
                mem_ref, 0, array, n_2, n_8, n_5,
                mem_ref, 1, array, n_2, n_13, n_8,
};

Single-Octet Encodings

There are six types of abstract data values representable in JSON:

  • Null
  • Boolean
  • Number
  • String
  • Array
  • Object

A small set of special values are encoded directly in a single octet:

encoding hex value
2#00000000 00 false
2#00000001 01 true
2#00000010 02 []
2#00000011 03 {}
2#00001111 0F ""
2#10000000 80 0
2#11111111 FF null

Small integer values (from -64 through 126) are also encoded in a single octet:

encoding hex value
2#01nnnnnn 40..7F -64..-1
2#1nnnnnnn 80..FE 0..126

The encoding of 0 falls naturally within this range. Negative integers are represented in 2's-complement format (all-bits-set = -1).

Multi-Octet Encodings

Extended values (Number, String, Array, and Object) occupy more than one octet:

encoding hex value extension
2#00000100 04 [...] size::Number ::Value*
2#00000101 05 {...} size::Number (name::String ::Value)*
2#00000110 06 [ count ] size::Number count::Number ::Value*n
2#00000111 07 { count } size::Number count::Number (name::String ::Value)*n
2#00001000 08 Octet* size::Number bytes::Octet*
2#00001001 09 @ Memo# index::Octet
2#0000101m 0A..0B UTF-8 size::Number chars::Octet*
2#0000110m 0C..0D UTF-16 size::Number (hi::Octet lo::Octet)*
2#00001110 0E encoded size::Number name::String data::Octet*
2#0001sppp 10..1F Integer &pad size::Number int::Octet*
2#0010sppp 20..2F Decimal &pad size::Number exp::Number int::Octet*
2#0011sppp 30..3F Based &pad size::Number base::Number exp::Number int::Octet*

Extended values (except memo refs) contain a size indicating how many octets the value occupies after the size, or how many to skip if the value is ignored.

Number

An extended Number may be an Integer (2#0001sppp), Decimal (2#0010sppp), or Based/Rational (2#0011sppp). The s field is the sign (0 for positive, 1 for negative). The ppp field is the number of padding bits added to the MSB (0 through 7) to fill the final octet. Padding bits match the sign.

encoding hex value extension
2#00010ppp 10..17 +Integer &pad size::Number int::Octet*
2#00011ppp 18..1F -Integer &pad size::Number int::Octet*
2#00100ppp 20..27 +Decimal &pad size::Number exp::Number int::Octet*
2#00101ppp 28..2F -Decimal &pad size::Number exp::Number int::Octet*
2#00110ppp 30..37 +Based &pad size::Number base::Number exp::Number int::Octet*
2#00111ppp 38..3F -Based &pad size::Number base::Number exp::Number int::Octet*
2#01nnnnnn 40..7F -small [-64..-1] -
2#1nnnnnnn 80..FE +small [0..126] -

The octets of the int portion are stored LSB first, with the MSB padded as described above. Negative int values are represented in 2's-complement format (all-bits-set = -1). Decimal values include an exp field that scales the value by powers of 10. Based values include a base field that specifies the base, as well as the exp field. The fully general formula for a Number is (int × base ^ exp). The default base is 10. The default exp is 0. Note that exact Rational numbers can be represented using base as the denominator with an exp of -1.

Binary Data (recommendation)

There are two ways to encode raw binary data. Either as an arbitrarily large Integer, or as a String of octets. A bit-string of any length can be encoded as an Integer with 0 or 1 padding to fill out the final octet. A byte-blob of any length can be encoded as a String of raw octets, which are just Unicode code-points restricted to the range 0 through 255. Neither encoding requires any sort of quote/escape processing.

String

An extended String begins with 2#00001eem where ee indicates encoding, and m indicates memoization.

encoding hex value extension
2#00001000 08 Octet* size::Number bytes::Octet*
2#00001001 09 @ Memo# index::Octet
2#0000101m 0A..0B UTF-8 size::Number chars::Octet*
2#0000110m 0C..0D UTF-16 size::Number (hi::Octet lo::Octet)*
2#00001110 0E encoded size::Number name::String data::Octet*
2#00001111 0F "" -

Next is the size of the value in octets, as defined above. Unless this is a memoized string reference (2#00001001), in which case the next octet is an index into the memoization table. The memoization table is treated as a ring-buffer, starting at 0 for each top-level Value in a stream. Memoized strings are often used for Object property names. For UTF-8 and UTF-16 values, when the m-bit is 1 an entry is stored at the current memo index and the index is incremented, wrapping around from 2#11111111 back to 2#00000000. If eem is 2#110, the size is followed by a String that names the encoding. A decoder will reject an encoding it does not recognize. If ee is 2#10 the string value consists of octet-pairs, encoding 16-bit values MSB-first (per RFC 2781). A UTF-16 encoded string value may begin with a byte-order-mark (included in the size, of course, but not in the string value) to signal MSB-first (16#FEFF) or LSB-first (16#FFFE) ordering of octets.

Capability (recommendation)

In applications that require the transmission of capabilities, a String may be used to represent capabilities in transit. This requires some way to distinguish these capabilities from normal Strings. Our recommendation is to mark each String with a distinguishing prefix. For normal Strings, a byte-order-mark is a safe choice. For capability Strings, a data-link-escape (which is 2#00010000, or ^P in ASCII) can provide the desired semantics, interpreting the rest of the String as binary-octet data.

Extended Types (recommendation)

Application-specific extended types may be represented as Encoded Strings (2#00001110). The encoding is specified by an application-chosen String. In a given application, the Strings representing encodings are likely to be heavily used. Therefore, we recommend marking the first occurance with the memo bit and using the single-octet memo reference index for subsequent occurances. This technique may also be used to embed already-encoded values verbatim, as octet sequences.

Array

An extended Array may (2#00000110), or may not (2#00000100), specify an element count. However, a size in octets is always provided for non-empty Arrays.

encoding hex value extension
2#00000010 02 [] -
2#00000100 04 [...] size::Number ::Value*
2#00000110 06 [ count ] size::Number count::Number ::Value*n

The end of the array is reached when then specified number of octets have been consumed, which should correspond to decoding the matching count of elements (if specified). A decoder may reject a mismatch.

Object

An extended Object may (2#00000111), or may not (2#00000101), specify a property count. However, a size in octets is always provided for non-empty Objects.

encoding hex value extension
2#00000011 03 {} -
2#00000101 05 {...} size::Number (name::String ::Value)*
2#00000111 07 { count } size::Number count::Number (name::String ::Value)*n

Properties are encoded as a String (property name) followed by an encoded Value. Note that the property name strings may be memoized, reducing the octet-size of the object. The end of the object is reached when then specified number of octets have been consumed, which should correspond to decoding the matching count of properties (if specified). A decoder may reject a mismatch.

Encoding Matrix

The following table summarizes the meaning of the first octet in a Value:

Hi \ Lo 2#_000 2#_001 2#_010 2#_011 2#_100 2#_101 2#_110 2#_111
2#00000_ false true [] {} [...] {...} [ n ] { n }
2#00001_ octets @ memo# UTF-8 UTF-8* UTF-16 UTF-16* encoded ""
2#00010_ +int &0 +int &1 +int &2 +int &3 +int &4 +int &5 +int &6 +int &7
2#00011_ -int &0 -int &1 -int &2 -int &3 -int &4 -int &5 -int &6 -int &7
2#00100_ +dec &0 +dec &1 +dec &2 +dec &3 +dec &4 +dec &5 +dec &6 +dec &7
2#00101_ -dec &0 -dec &1 -dec &2 -dec &3 -dec &4 -dec &5 -dec &6 -dec &7
2#00110_ +base &0 +base &1 +base &2 +base &3 +base &4 +base &5 +base &6 +base &7
2#00111_ -base &0 -base &1 -base &2 -base &3 -base &4 -base &5 -base &6 -base &7
2#01000_ -64 -63 -62 -61 -60 -59 -58 -57
2#01001_ -56 -55 -54 -53 -52 -51 -50 -49
2#01010_ -48 -47 -46 -45 -44 -43 -42 -41
2#01011_ -40 -39 -38 -37 -36 -35 -34 -33
2#01100_ -32 -31 -30 -29 -28 -27 -26 -25
2#01101_ -24 -23 -22 -21 -20 -19 -18 -17
2#01110_ -16 -15 -14 -13 -12 -11 -10 -9
2#01111_ -8 -7 -6 -5 -4 -3 -2 -1
2#10000_ 0 1 2 3 4 5 6 7
2#10001_ 8 9 10 11 12 13 14 15
2#10010_ 16 17 18 19 20 21 22 23
2#10011_ 24 25 26 27 28 29 30 31
2#10100_ 32 33 34 35 36 37 38 39
2#10101_ 40 41 42 43 44 45 46 47
2#10110_ 48 49 50 51 52 53 54 55
2#10111_ 56 57 58 59 60 61 62 63
2#11000_ 64 65 66 67 68 69 70 71
2#11001_ 72 73 74 75 76 77 78 79
2#11010_ 80 81 82 83 84 85 86 87
2#11011_ 88 89 90 91 92 93 94 95
2#11100_ 96 97 98 99 100 101 102 103
2#11101_ 104 105 106 107 108 109 110 111
2#11110_ 112 113 114 115 116 117 118 119
2#11111_ 120 121 122 123 124 125 126 null

Mycelia Actors

An actor in Mycelia occupies a cache-line, which is 32 bytes, or 8x32-bit words. The first word is an executable instruction that loads the last word (offset 0x1c) as the code to execute as the actor's behavior.

        +--------+--------+--------+--------+
  0x00  |       ldr     pc, [ip, #0x1c]     | = 0xE59CF01C
        +--------+--------+--------+--------+
  0x04  | actor state                       |
        +        .        .        .        +
  0x08  |                                   |
        +        .        .        .        +
  0x0c  |                                   |
        +        .        .        .        +
  0x10  |                                   |
        +        .        .        .        +
  0x14  |                                   |
        +        .        .        .        +
  0x18  |                                   |
        +--------+--------+--------+--------+
  0x1c  | address of actor behavior         |
        +--------+--------+--------+--------+

On entry to the behavior, r10 (sl) points to the sponsor, r11 (fp) points to the message-event, r12 (ip) points to the actor.

Value

An extended value, in BOSE, occupies more than one octet. The prefix carries type and encoding information. The size field immediately follows the prefix, and describes the number of octets occupied by the data following the size.

        +--------+--------+--------+--------+
  0x00  |       ldr     pc, [ip, #0x1c]     |
        +--------+--------+--------+--------+
  0x04  |   --   | prefix | +int_0 |  n_4   |
        +--------+--------+--------+--------+
  0x08  | size in octets (lsb first)        |
        +--------+--------+--------+--------+
  0x0c  |                                   |
        +        .        .        .        +
  0x10  |    12 octets of immediate data    |
        +        .        .        .        +
  0x14  |                                   |
        +--------+--------+--------+--------+
  0x18  | pointer to extended data block(s) |
        +--------+--------+--------+--------+
  0x1c  | address of actor behavior         |
        +--------+--------+--------+--------+

If the size is greater than 12 octets, the pointer (at offset 0x18) holds the memory address of the next block of data.

Extension Blocks

Extension blocks are not actors. They are simply 28 octets of data and a pointer (at offset 0x1c) to the next extension block (if any).

        +--------+--------+--------+--------+
  0x00  |                                   |
        +        .        .        .        +
  0x04  |                                   |
        +        .        .        .        +
  0x08  |    28 octets of extension data    |
        +        .        .        .        +
  0x0c  |                                   |
        +        .        .        .        +
  0x10  |     (28 octets = 7 pointers)      |
        +        .        .        .        +
  0x14  |                                   |
        +        .        .        .        +
  0x18  |                                   |
        +--------+--------+--------+--------+
  0x1c  | pointer to next extension block   |
        +--------+--------+--------+--------+

Boolean and Null

The values true, false, and null are represented by static singleton actors. The prefix field is the single-octet encoding of the value.

        +--------+--------+--------+--------+
  0x00  |       ldr     pc, [ip, #0x1c]     |
        +--------+--------+--------+--------+
  0x04  |        | prefix |                 |
        +        +--------+        .        +
  0x08  |                                   |
        +        .        .        .        +
  0x0c  |                                   |
        +        .        .        .        +
  0x10  |                                   |
        +        .        .        .        +
  0x14  |                                   |
        +        .        .        .        +
  0x18  |                                   |
        +--------+--------+--------+--------+
  0x1c  | address of actor behavior         |
        +--------+--------+--------+--------+

Number

Integers that are encoded in 32-bits or less can represent the entire number in a single actor. The size field is n_4 (0x84) because the 32-bit value will be encoded in 4 octets.

        +--------+--------+--------+--------+
  0x00  |       ldr     pc, [ip, #0x1c]     |
        +--------+--------+--------+--------+
  0x04  |   --   | int_0  |  n_4   |   --   |
        +--------+--------+--------+--------+
  0x08  |            32-bit int             |
        +--------+--------+--------+--------+
  0x0c  |              exp = 0              |
        +--------+--------+--------+--------+
  0x10  |             base = 10             |
        +--------+--------+--------+--------+
  0x14  |                                   |
        +        .        .        .        +
  0x18  |                                   |
        +--------+--------+--------+--------+
  0x1c  | address of actor behavior         |
        +--------+--------+--------+--------+

Decimal numbers that are encoded in 32-bits per component can represent the entire number in a single actor. The size field is n_10 (0x8a) because the expr component will be encoded with a 2-octet prefix.

        +--------+--------+--------+--------+
  0x00  |       ldr     pc, [ip, #0x1c]     |
        +--------+--------+--------+--------+
  0x04  |   --   | dec_0  |  n_10  |   --   |
        +--------+--------+--------+--------+
  0x08  |            32-bit int             |
        +--------+--------+--------+--------+
  0x0c  |            32-bit exp             |
        +--------+--------+--------+--------+
  0x10  |             base = 10             |
        +--------+--------+--------+--------+
  0x14  |                                   |
        +        .        .        .        +
  0x18  |                                   |
        +--------+--------+--------+--------+
  0x1c  | address of actor behavior         |
        +--------+--------+--------+--------+

Based numbers that are encoded in 32-bits per component can represent the entire number in a single actor. The size field is n_16 (0x90) because the expr and base components will be encoded with a 2-octet prefix each.

        +--------+--------+--------+--------+
  0x00  |       ldr     pc, [ip, #0x1c]     |
        +--------+--------+--------+--------+
  0x04  |   --   | base_0 |  n_16  |   --   |
        +--------+--------+--------+--------+
  0x08  |            32-bit int             |
        +--------+--------+--------+--------+
  0x0c  |            32-bit exp             |
        +--------+--------+--------+--------+
  0x10  |            32-bit base            |
        +--------+--------+--------+--------+
  0x14  |                                   |
        +        .        .        .        +
  0x18  |                                   |
        +--------+--------+--------+--------+
  0x1c  | address of actor behavior         |
        +--------+--------+--------+--------+

Note that all three fixed-field formats include values for all three components, using default values for non-encoded components.

String

Strings that are encoded in 20 octets or less can represent the entire string in a single actor and avoid allocating any extension blocks.

        +--------+--------+--------+--------+
  0x00  |       ldr     pc, [ip, #0x1c]     |
        +--------+--------+--------+--------+
  0x04  |   --   | prefix |  size  |        |
        +--------+--------+--------+        +
  0x08  |                                   |
        +        .        .        .        +
  0x0c  |                                   |
        +        .        .        .        +
  0x10  |    20 octets of character data    |
        +        .        .        .        +
  0x14  |                                   |
        +        .        .        +--------+
  0x18  |                          |  0x00  |
        +--------+--------+--------+--------+
  0x1c  | address of actor behavior         |
        +--------+--------+--------+--------+

The size field will be between n_0 (0x80) and n_20 (0x94). The size field is not necessarily equal to the number of characters. The number of character depends on the encoding. For example, in UTF-16 at most 10 characters can be encoded. Note the extra '\0' (0x00) character at the end of the data, which in not technically part of the string, but may make the string easier to handle in languages like C.

Iterator

Iterating through the characters of a string involves decoding a variety of variable-length encodings. A stateful Iterator object keeps track of a location within a string, giving sequential access to unicode characters.

        +--------+--------+--------+--------+
  0x00  |       ldr     pc, [ip, #0x1c]     |
        +--------+--------+--------+--------+
  0x04  | total octets remaining            |
        +--------+--------+--------+--------+
  0x08  | pointer to starting octet         |
        +--------+--------+--------+--------+
  0x0c  | pointer to ending octet (last+1)  |
        +--------+--------+--------+--------+
  0x10  |                                   |
        +        .        .        .        +
  0x14  |                                   |
        +--------+--------+--------+--------+
  0x18  | address of decode procedure       |
        +--------+--------+--------+--------+
  0x1c  | address of actor behavior         |
        +--------+--------+--------+--------+

Builder

The complement of an Iterator is a Builder, which mutates a String by sequentially appending unicode characters. On creation, a builder allocates a new empty String.

        +--------+--------+--------+--------+
  0x00  |       ldr     pc, [ip, #0x1c]     |
        +--------+--------+--------+--------+
  0x04  | pointer to String being mutated   |
        +--------+--------+--------+--------+
  0x08  | pointer to starting octet         |
        +--------+--------+--------+--------+
  0x0c  | pointer to ending octet (last+1)  |
        +--------+--------+--------+--------+
  0x10  |                                   |
        +        .        .        .        +
  0x14  |                                   |
        +--------+--------+--------+--------+
  0x18  | address of encode procedure       |
        +--------+--------+--------+--------+
  0x1c  | address of actor behavior         |
        +--------+--------+--------+--------+

Array

Arrays do not directly contain their elements, but rather contain pointers to each element as a separate actor/value.

        +--------+--------+--------+--------+
  0x00  |       ldr     pc, [ip, #0x1c]     |
        +--------+--------+--------+--------+
  0x04  |   --   | array  | +int_0 |  n_4   |
        +--------+--------+--------+--------+
  0x08  | size in octets (lsb first)        |
        +--------+--------+--------+--------+
  0x0c  |                                   |
        +        +        +        +        +
  0x10  |   pointers to first 3 elements    |
        +        +        +        +        +
  0x14  |                                   |
        +--------+--------+--------+--------+
  0x18  | pointer to extended data block(s) |
        +--------+--------+--------+--------+
  0x1c  | address of actor behavior         |
        +--------+--------+--------+--------+

The size field is the number of octets occupied by the element pointers, thus (size >> 2) is the number of elements. Note that each extension block holds 7 data pointers (plus a pointer to the next extension block).

Object

Objects do not directly contain their properties, but rather contain pairs of pointers for each property name and property value.

        +--------+--------+--------+--------+
  0x00  |       ldr     pc, [ip, #0x1c]     |
        +--------+--------+--------+--------+
  0x04  |   --   | object | +int_0 |  n_4   |
        +--------+--------+--------+--------+
  0x08  | size in octets (lsb first)        |
        +--------+--------+--------+--------+
  0x0c  |  pointer to first property name   |
        +        +        +        +        +
  0x10  |  pointer to first property value  |
        +        +        +        +        +
  0x14  |  pointer to second property name  |
        +--------+--------+--------+--------+
  0x18  | pointer to extended data block(s) |
        +--------+--------+--------+--------+
  0x1c  | address of actor behavior         |
        +--------+--------+--------+--------+

The size field is the number of octets occupied by the property pointers, thus (size >> 3) is the number of properties. Note that each extension block holds 7 data pointers (plus a pointer to the next extension block).