The idea here is to allow all Unicode scalar values to be encoded such that both C0 and C1 control codes are transparent. It is therefore somewhat compliant with ISO-2022. Only code points within the Basic Multilingual Plane (BMP) can be encoded with CF-8. As such, the BMP’s surrogate pair system must be used to encode supplementary characters.
All BMP scalar values are encoded using anywhere from one to three bytes by splitting the value into that many bit fields and then adding the appropriate bias for each byte, if applicable. The 0bXXXXXXXX, 0bXXXX, and 0bXXXXXX placeholders below represent 8-, 4-, and 6-bit values, respectively. Bias offsets (for two- and three-byte values) are fixed and shown as hexadecimal constants.
| Low | High | Byte 1 | Byte 2 | Byte 3 |
|---|---|---|---|---|
| U+0000 | U+009F | 0bXXXXXXXX |
||
| U+00A0 | U+03FF | 0xE0 + 0bXXXX |
0xA0 + 0bXXXXXX |
|
| U+0400 | U+FFFF | 0xF0 + 0bXXXX |
0xA0 + 0bXXXXXX |
0xA0 + 0bXXXXXX |
In the case of encoding the Chinese character 永 (eternity), this character has a scalar value of U+6C38. Referencing the table above, this value will need three bytes to encode. Accordingly, the value is mapped out into three separate fields that are 4-bit, 6-bit, and 6-bit, giving us a total of 16 bits:
0b0110 0b110000 0b111000
The three final byte values are calculated like so, with the hexadecimal values being constant in all three-byte cases:
byte_1 = 0xF0 + 0b0110byte_2 = 0xA0 + 0b110000byte_3 = 0xA0 + 0b111000
The decoding steps are performed in reverse.
Decoders should reject any decoded overlong values. That is, each scalar value should be encoded using as few bytes as possible. Any decoded Unicode value that doesn’t follow this rule should be rejected.
CF-8’s byte order mark is: 0xFF 0xDB 0xDF.
The below table compares how many bytes CF-8 requires for different scalar values compared to other encodings.
| Low | High | CF-8 | UTF-8 | UTF-16 | UTF-32 |
|---|---|---|---|---|---|
| U+0000 | U+007F | 1 | 1 | 2 | 4 |
| U+0080 | U+009F | 1 | 2 | 2 | 4 |
| U+00A0 | U+03FF | 2 | 2 | 2 | 4 |
| U+0400 | U+07FF | 3 | 2 | 2 | 4 |
| U+0800 | U+FFFF | 3 | 3 | 2 | 4 |
| U+010000 | U+10FFFF | 6 | 4 | 4 | 4 |
CF-8 has the following similarities to UTF-8:
And it has the following differences:
0xFE and 0xFF bytes may appear in encoded streams.CF-8 follows ISO-2022 in the following ways:
And it deviates in the following ways: