Knowing that:
I have taken it upon myself to create a 7-bit encoding scheme for Unicode.
This encoding has a number of advantages over UTF-7:
U+10FFFF
encodes in six ASCII values.Some Unicode characters are encoded directly as single ASCII values. Others are encoded indirectly as multiple values, up to and including six.
The following character ranges are encoded directly by making the encoded ASCII value the same as the Unicode scalar value.
Low | High | Description |
---|---|---|
U+00 | U+20 | Control codes and space |
U+30 | U+39 | Numbers |
U+41 | U+5A | Latin uppercase |
U+61 | U+7A | Latin lowercase |
U+7F | Delete |
Directly encoded characters can be interpreted as-is by an ASCII system without STF-7 support.
Any Unicode character not directly encoded is indirectly encoded. In this case, the scalar value is split into chunks of four bits.
Low | High | ASCII Values |
---|---|---|
U+21 | U+2F | 2 |
U+3A | U+40 | 2 |
U+5B | U+60 | 2 |
U+7B | U+7E | 2 |
U+80 | U+FF | 2 |
U+100 | U+FFF | 3 |
U+1000 | U+FFFF | 4 |
U+10000 | U+FFFFF | 5 |
U+100000 | U+10FFFF | 6 |
The number of ASCII values needed will reflect the number of four bit chunks the scalar value will be divided into. Padding is done by prepending any necessary zeroes to the most significant chunk. Each chunk will inherently have a value between 0x0
and 0xF
.
Start with the most significant chunk and work your way through each one except for the last one. Output a single ASCII value for each one in place of its four bit value according to the table below:
Chunk Value | ASCII Value |
---|---|
0x0 | 0x21 |
0x1 | 0x22 |
0x2 | 0x23 |
0x3 | 0x24 |
0x4 | 0x25 |
0x5 | 0x26 |
0x6 | 0x27 |
0x7 | 0x28 |
0x8 | 0x29 |
0x9 | 0x2A |
0xA | 0x2B |
0xB | 0x2C |
0xC | 0x2D |
0xD | 0x2E |
0xE | 0x2F |
0xF | 0x3A |
For a Unicode value's final chunk, use the below table instead:
Chunk Value | ASCII Value |
---|---|
0x0 | 0x3B |
0x1 | 0x3C |
0x2 | 0x3D |
0x3 | 0x3E |
0x4 | 0x3F |
0x5 | 0x40 |
0x6 | 0x5B |
0x7 | 0x5C |
0x8 | 0x5D |
0x9 | 0x5E |
0xA | 0x5F |
0xB | 0x60 |
0xC | 0x7B |
0xD | 0x7C |
0xE | 0x7D |
0xF | 0x7E |
After this, the next Unicode value can be encoded.
The decoding steps are performed in reverse.
Decoders should enforce these two specific rules:
Any decoded Unicode value that doesn't follow these rules should be rejected.
When randomly seeking, do not immediately resume decoding unless a directly encoded value is detected. Seek immediately past the next closing chunk instead, and resume decoding there.
STF-7's byte order mark is: 0x3A 0x2F 0x3A 0x7E
.
Below are some samples of how STF-7 encodes different characters.
Text | STF-7 |
---|---|
the quick brown fox jumps over the lazy dogs |
the quick brown fox jumps over the lazy dogs |
Hello, world! |
Hello#{ world#< |
Emoji: 👦 |
Emoji$_ ":%'[ |
Ethnically-modified emoji: 👦🏿 |
Ethnically#|modified emoji$_ ":%'[":$:~ |
Καλά, ευχαριστώ |
$*_$,<$,`$+{#{ $,@$-@$-\$,<$-<$,^$->$-?$-} |
иметь много общего |
%$]%${%$@%%=%%{ %${%$|%$}%$>%$} %$}%$<%%^%$@%$>%$} |
אני מדבר קצת עברית |
&.;&/;&.^ &.}&.>&.<&/] &/\&/[&/_ &/=&.<&/]&.^&/_ |
حفلات تخرّج |
'#|'%<'%?'#\'#_ '#_'#}'$<'&<'#{ |
दोपहर के बाद नमस्कार |
*#[*%`*#_*$^*$; *"@*%\ *#{*$}*#[ *#]*#}*$]*%|*"@*$}*$; |
私の酒です。 |
(*-<$!'}*"&=$!'\$!&^$!!= |
안녕하세요! |
-&%],"&@.&&]-"$]-'*?#< |
제가 美國人입니다: 🇺🇸🧑 |
-)"{+-!; (:)}&(!`%/,_-()@,#-],#/?$_ ":":_":":]":*.< |
事实胜于雄辩。 |
%/)`&,*})!.{%/)}*'-?):+^$!!= |
The below table compares how many bytes STF-7 requires for different scalar values compared to other encodings. UTF-7 is not listed as it is rather freeform and arbitrary.
Low | High | STF-7 | UTF-8 | UTF-16 | UTF-32 |
---|---|---|---|---|---|
U+00 | U+7F | 1|2 | 1 | 2 | 4 |
U+80 | U+FF | 2 | 2 | 2 | 4 |
U+100 | U+7FF | 3 | 2 | 2 | 4 |
U+800 | U+FFF | 3 | 3 | 2 | 4 |
U+1000 | U+FFFF | 4 | 3 | 2 | 4 |
U+10000 | U+FFFFF | 5 | 4 | 4 | 4 |
U+100000 | U+10FFFF | 6 | 4 | 4 | 4 |