CF-8: The 8-bit Control Freak Format

Overview

The idea here is to allow all Unicode scalar values to be encoded such that both C0 and C1 control codes are transparent. It is therefore somewhat compliant with ISO-2022. Only code points within the Basic Multilingual Plane (BMP) can be encoded with CF-8. As such, the BMP’s surrogate pair system must be used to encode supplementary characters.

Encoding Steps

All BMP scalar values are encoded using anywhere from one to three bytes by splitting the value into that many bit fields and then adding the appropriate bias for each byte, if applicable. The 0bXXXXXXXX, 0bXXXX, and 0bXXXXXX placeholders below represent 8-, 4-, and 6-bit values, respectively. Bias offsets (for two- and three-byte values) are fixed and shown as hexadecimal constants.

Low High Byte 1 Byte 2 Byte 3
U+0000 U+009F 0bXXXXXXXX
U+00A0 U+03FF 0xE0 + 0bXXXX 0xA0 + 0bXXXXXX
U+0400 U+FFFF 0xF0 + 0bXXXX 0xA0 + 0bXXXXXX 0xA0 + 0bXXXXXX

Encoding Example

In the case of encoding the Chinese character 永 (eternity), this character has a scalar value of U+6C38. Referencing the table above, this value will need three bytes to encode. Accordingly, the value is mapped out into three separate fields that are 4-bit, 6-bit, and 6-bit, giving us a total of 16 bits:

0b0110 0b110000 0b111000

The three final byte values are calculated like so, with the hexadecimal values being constant in all three-byte cases:

byte_1 = 0xF0 + 0b0110
byte_2 = 0xA0 + 0b110000
byte_3 = 0xA0 + 0b111000

Decoding

The decoding steps are performed in reverse.

Decoders should reject any decoded overlong values. That is, each scalar value should be encoded using as few bytes as possible. Any decoded Unicode value that doesn’t follow this rule should be rejected.

Byte Order Mark

CF-8’s byte order mark is: 0xFF 0xDB 0xDF.

Comparison

The below table compares how many bytes CF-8 requires for different scalar values compared to other encodings.

Low High CF-8 UTF-8 UTF-16 UTF-32
U+0000 U+007F 1 1 2 4
U+0080 U+009F 1 2 2 4
U+00A0 U+03FF 2 2 2 4
U+0400 U+07FF 3 2 2 4
U+0800 U+FFFF 3 3 2 4
U+010000 U+10FFFF 6 4 4 4

Versus UTF-8

CF-8 has the following similarities to UTF-8:

And it has the following differences:

Versus ISO-2022

CF-8 follows ISO-2022 in the following ways:

And it deviates in the following ways: