STF-7: The Silly 7-bit Transformation Format

Premise

Knowing that:

  1. The zombie apocalypse could come upon us at any moment,
  2. This will leave us with nothing except 7-bit teletype machines for communication,
  3. We will still need to send emojis to each other,

I have taken it upon myself to create a 7-bit encoding scheme for Unicode.

Features

This encoding has a number of advantages over UTF-7:

Encoding Steps

Some Unicode characters are encoded directly as single ASCII values. Others are encoded indirectly as multiple values, up to and including six.

Direct Encoding

The following character ranges are encoded directly by making the encoded ASCII value the same as the Unicode scalar value.

Low High Description
U+00 U+20 Control codes and space
U+30 U+39 Numbers
U+41 U+5A Latin uppercase
U+61 U+7A Latin lowercase
U+7F Delete

Directly encoded characters can be interpreted as-is by an ASCII system without STF-7 support.

Indirect Encoding

Any Unicode character not directly encoded is indirectly encoded. In this case, the scalar value is split into chunks of four bits.

Low High ASCII Values
U+21 U+2F 2
U+3A U+40 2
U+5B U+60 2
U+7B U+7E 2
U+80 U+FF 2
U+100 U+FFF 3
U+1000 U+FFFF 4
U+10000 U+FFFFF 5
U+100000 U+10FFFF 6

The number of ASCII values needed will reflect the number of four bit chunks the scalar value will be divided into. Padding is done by prepending any necessary zeroes to the most significant chunk. Each chunk will inherently have a value between 0x0 and 0xF.

Initial Chunks

Start with the most significant chunk and work your way through each one except for the last one. Output a single ASCII value for each one in place of its four bit value according to the table below:

Chunk Value ASCII Value
0x0 0x21
0x1 0x22
0x2 0x23
0x3 0x24
0x4 0x25
0x5 0x26
0x6 0x27
0x7 0x28
0x8 0x29
0x9 0x2A
0xA 0x2B
0xB 0x2C
0xC 0x2D
0xD 0x2E
0xE 0x2F
0xF 0x3A

Closing Chunk

For a Unicode value's final chunk, use the below table instead:

Chunk Value ASCII Value
0x0 0x3B
0x1 0x3C
0x2 0x3D
0x3 0x3E
0x4 0x3F
0x5 0x40
0x6 0x5B
0x7 0x5C
0x8 0x5D
0x9 0x5E
0xA 0x5F
0xB 0x60
0xC 0x7B
0xD 0x7C
0xE 0x7D
0xF 0x7E

After this, the next Unicode value can be encoded.

Decoding

The decoding steps are performed in reverse.

Decoders should enforce these two specific rules:

  1. Any character that can be directly encoded should be directly encoded.
  2. Indirectly encoded characters should be encoded using as few bytes as possible.

Any decoded Unicode value that doesn't follow these rules should be rejected.

When randomly seeking, do not immediately resume decoding unless a directly encoded value is detected. Seek immediately past the next closing chunk instead, and resume decoding there.

Samples

Text STF-7
the quick brown fox jumps over the lazy dogs the quick brown fox jumps over the lazy dogs
Hello, world! Hello#{ world#<
Emoji: ๐Ÿ‘ฆ Emoji$_ ":%'[
Ethnically-modified emoji: ๐Ÿ‘ฆ๐Ÿฟ Ethnically#|modified emoji$_ ":%'[":$:~
์•ˆ๋…•ํ•˜์„ธ์š”! -&%],"&@.&&]-"$]-'*?#<
์ œ๊ฐ€ ็พŽๅœ‹ไบบ์ž…๋‹ˆ๋‹ค: ๐Ÿ‡บ๐Ÿ‡ธ๐Ÿง‘ -)"{+-!; (:)}&(!`%/,_-()@,#-],#/?$_ ":":_":":]":*.<
็งใฎ้…’ใงใ™ใ€‚ (*-<$!'}*"&=$!'\$!&^$!!=

Byte Order Mark

STF-7's byte order mark is: 0x3A 0x2F 0x3A 0x7E