The idea here is to allow all Unicode scalar values to be encoded such that both C0 and C1 control codes are perfectly preserved. It is somewhat compliant with ISO-2022.
CF-8 maps Unicode scalar values to byte values in the following fashion:
Low | High | Byte 1 | Byte 2 | Byte 3 |
---|---|---|---|---|
U+0000 | U+009F | 0x00 – 0x9F |
||
U+00A0 | U+03FF | 0xE0 – 0xEF |
0xA0 – 0xDF |
|
U+0400 | U+FFFF | 0xF0 – 0xFF |
0xA0 – 0xDF |
0xA0 – 0xDF |
As only the Basic Multilingual Plane is supported, surrogate pairs must be used to encode supplementary characters.
In the case of encoding the Chinese character 永 (eternity), this character has a scalar value of U+6C38
. Referencing the table above, this value will need three bytes to encode. Accordingly, the value is mapped out into three separate fields that are 4-bit, 6-bit, and 6-bit, giving us a total of 16 bits:
0b0110 0b110000 0b111000
The three final byte values are calculated like so, with the hexadecimal values being constant in all three-byte cases:
byte_1 = 0xF0 + 0b0110
byte_2 = 0xA0 + 0b110000
byte_3 = 0xA0 + 0b111000
CF-8’s byte order mark is: 0xFF 0xDB 0xDF
.
The below table compares how many bytes CF-8 requires for different scalar values compared to other encodings.
Low | High | CF-8 | UTF-8 | UTF-16 | UTF-32 |
---|---|---|---|---|---|
U+0000 | U+007F | 1 | 1 | 2 | 4 |
U+0080 | U+009F | 1 | 2 | 2 | 4 |
U+00A0 | U+03FF | 2 | 2 | 2 | 4 |
U+0400 | U+07FF | 3 | 2 | 2 | 4 |
U+0800 | U+FFFF | 3 | 3 | 2 | 4 |
U+010000 | U+10FFFF | 6 | 4 | 4 | 4 |
CF-8 has the following similarities to UTF-8:
And it has the following differences:
0xFE
and 0xFF
bytes may appear in encoded streams.CF-8 follows ISO-2022 in the following ways:
And it deviates in the following ways: