CF-8: The 8-bit Control Freak Format

Premise

The idea here is to allow all Unicode scalar values to be encoded such that both C0 and C1 control codes are perfectly preserved. It is somewhat compliant with ISO-2022.

Approach

CF-8 maps Unicode scalar values to byte values in the following fashion:

Low High Byte 1 Byte 2 Byte 3
U+0000 U+009F 0x000x9F
U+00A0 U+03FF 0xE00xEF 0xA00xDF
U+0400 U+FFFF 0xF00xFF 0xA00xDF 0xA00xDF

As only the Basic Multilingual Plane is supported, surrogate pairs must be used to encode supplementary characters.

Encoding Example

In the case of encoding the Chinese character 永 (eternity), this character has a scalar value of U+6C38. Referencing the table above, this value will need three bytes to encode. Accordingly, the value is mapped out into three separate fields that are 4-bit, 6-bit, and 6-bit, giving us a total of 16 bits:

0b0110 0b110000 0b111000

The three final byte values are calculated like so, with the hexadecimal values being constant in all three-byte cases:

byte_1 = 0xF0 + 0b0110
byte_2 = 0xA0 + 0b110000
byte_3 = 0xA0 + 0b111000

Byte Order Mark

CF-8’s byte order mark is: 0xFF 0xDB 0xDF.

Comparison

The below table compares how many bytes CF-8 requires for different scalar values compared to other encodings.

Low High CF-8 UTF-8 UTF-16 UTF-32
U+0000 U+007F 1 1 2 4
U+0080 U+009F 1 2 2 4
U+00A0 U+03FF 2 2 2 4
U+0400 U+07FF 3 2 2 4
U+0800 U+FFFF 3 3 2 4
U+010000 U+10FFFF 6 4 4 4

Versus UTF-8

CF-8 has the following similarities to UTF-8:

And it has the following differences:

Versus ISO-2022

CF-8 follows ISO-2022 in the following ways:

And it deviates in the following ways:

Public Domain