STF-7: The Silly 7-bit Transformation Format

Premise

Knowing that:

  1. The zombie apocalypse could come upon us at any moment,
  2. This will leave us with nothing except 7-bit teletype machines for communication,
  3. We will still need to send emojis to each other,

I have taken it upon myself to create a 7-bit encoding scheme for Unicode.

Features

This encoding has a number of advantages over UTF-7:

Note that STF-7 still shares some of UTF-7's security concerns. See the dedicated section below for more information.

Encoding Steps

Some Unicode characters are encoded directly as single ASCII values. Others are encoded indirectly as multiple values, up to and including six.

Direct Encoding

The following character ranges are encoded directly by making the encoded ASCII value the same as the Unicode scalar value.

Low High Description
U+00 U+20 Control codes and space
U+30 U+39 Numbers
U+41 U+5A Latin uppercase
U+61 U+7A Latin lowercase
U+7F Delete

Directly encoded characters can be interpreted as-is by an ASCII system without STF-7 support.

Indirect Encoding

Any Unicode character not directly encoded is indirectly encoded. In this case, the scalar value is split into chunks of four bits.

Low High Chunks
U+21 U+2F 2
U+3A U+40 2
U+5B U+60 2
U+7B U+7E 2
U+80 U+FF 2
U+100 U+FFF 3
U+1000 U+FFFF 4
U+10000 U+FFFFF 5
U+100000 U+10FFFF 6

The number of ASCII values needed will reflect the number of four bit chunks the scalar value will be divided into. Padding is done by prepending any necessary zeroes to the most significant chunk. Each chunk will inherently have a value between 0x0 and 0xF.

Initial Chunks

Start with the most significant chunk and work your way through each one except for the last one. Output a single ASCII value for each one in place of its four bit value according to the table below:

Chunk Value ASCII Value
0x0 0x21
0x1 0x22
0x2 0x23
0x3 0x24
0x4 0x25
0x5 0x26
0x6 0x27
0x7 0x28
0x8 0x29
0x9 0x2A
0xA 0x2B
0xB 0x2C
0xC 0x2D
0xD 0x2E
0xE 0x2F
0xF 0x3A

Closing Chunk

For a Unicode value's final chunk, use the below table instead:

Chunk Value ASCII Value
0x0 0x3B
0x1 0x3C
0x2 0x3D
0x3 0x3E
0x4 0x3F
0x5 0x40
0x6 0x5B
0x7 0x5C
0x8 0x5D
0x9 0x5E
0xA 0x5F
0xB 0x60
0xC 0x7B
0xD 0x7C
0xE 0x7D
0xF 0x7E

After this, the next Unicode value can be encoded.

Decoding

The decoding steps are performed in reverse.

Decoders should enforce these two specific rules:

  1. Any character that can be directly encoded should be directly encoded.
  2. Indirectly encoded characters should be encoded using as few bytes as possible.

Any decoded Unicode value that doesn't follow these rules should be rejected.

When randomly seeking, do not immediately resume decoding unless a directly encoded value is detected. Seek immediately past the next closing chunk instead, and resume decoding there.

Byte Order Mark

STF-7's byte order mark is: 0x3A 0x2F 0x3A 0x7E.

Security Concerns

As STF-7 does not directly encode the entire ASCII range, certain single ASCII values are represented using pairs of ASCII values in the encoded output.

Consider, for instance, the hypothetical case of filtering JavaScript out of HTML. As such, we are going to be filtering the encoded text for any occurrences of <script>. We can typically do this reliably under most code pages because the Basic ASCII range (0x00-0x7F) is encoded as-is. But with STF-7, <script> is encoded as ${script$}, thereby allowing it to evade filtering. The encoded text can still be decoded later in the process and executed with every script block left in tact.

Therefore, STF-7 should only be implemented as a part of a tightly-controlled system. Encoding should happen immediately before entering a 7-bit transmission medium and decoding should happen immediately after exiting this medium.

Comparison

The below table compares how many bytes STF-7 requires for different scalar values compared to other encodings. UTF-7 is not listed as it is rather freeform and arbitrary.

Low High STF-7 UTF-8 UTF-16 UTF-32
U+00 U+7F 1|2 1 2 4
U+80 U+FF 2 2 2 4
U+100 U+7FF 3 2 2 4
U+800 U+FFF 3 3 2 4
U+1000 U+FFFF 4 3 2 4
U+10000 U+FFFFF 5 4 4 4
U+100000 U+10FFFF 6 4 4 4

Copyright

To the extent possible under law, William Swartzendruber has waived all copyright and related or neighboring rights to STF-7: The Silly 7-bit Transformation Format.

This work is published from: United States of America

CC0