Knowing that:
I have taken it upon myself to create a 7-bit encoding scheme for Unicode.
This encoding has a number of advantages over UTF-7:
U+10FFFF
encodes in six ASCII values.Note that STF-7 still shares some of UTF-7's security concerns. See the dedicated section below for more information.
Some Unicode characters are encoded directly as single ASCII values. Others are encoded indirectly as multiple values, up to and including six.
The following character ranges are encoded directly by making the encoded ASCII value the same as the Unicode scalar value.
Low | High | Description |
---|---|---|
U+00 | U+20 | Control codes and space |
U+30 | U+39 | Numbers |
U+41 | U+5A | Latin uppercase |
U+61 | U+7A | Latin lowercase |
U+7F | Delete |
Directly encoded characters can be interpreted as-is by an ASCII system without STF-7 support.
Any Unicode character not directly encoded is indirectly encoded. In this case, the scalar value is split into chunks of four bits.
Low | High | Chunks |
---|---|---|
U+21 | U+2F | 2 |
U+3A | U+40 | 2 |
U+5B | U+60 | 2 |
U+7B | U+7E | 2 |
U+80 | U+FF | 2 |
U+100 | U+FFF | 3 |
U+1000 | U+FFFF | 4 |
U+10000 | U+FFFFF | 5 |
U+100000 | U+10FFFF | 6 |
The number of ASCII values needed will reflect the number of four bit chunks the scalar value will be divided into. Padding is done by prepending any necessary zeroes to the most significant chunk. Each chunk will inherently have a value between 0x0
and 0xF
.
Start with the most significant chunk and work your way through each one except for the last one. Output a single ASCII value for each one in place of its four bit value according to the table below:
Chunk Value | ASCII Value |
---|---|
0x0 | 0x21 |
0x1 | 0x22 |
0x2 | 0x23 |
0x3 | 0x24 |
0x4 | 0x25 |
0x5 | 0x26 |
0x6 | 0x27 |
0x7 | 0x28 |
0x8 | 0x29 |
0x9 | 0x2A |
0xA | 0x2B |
0xB | 0x2C |
0xC | 0x2D |
0xD | 0x2E |
0xE | 0x2F |
0xF | 0x3A |
For a Unicode value's final chunk, use the below table instead:
Chunk Value | ASCII Value |
---|---|
0x0 | 0x3B |
0x1 | 0x3C |
0x2 | 0x3D |
0x3 | 0x3E |
0x4 | 0x3F |
0x5 | 0x40 |
0x6 | 0x5B |
0x7 | 0x5C |
0x8 | 0x5D |
0x9 | 0x5E |
0xA | 0x5F |
0xB | 0x60 |
0xC | 0x7B |
0xD | 0x7C |
0xE | 0x7D |
0xF | 0x7E |
After this, the next Unicode value can be encoded.
The decoding steps are performed in reverse.
Decoders should enforce these two specific rules:
Any decoded Unicode value that doesn't follow these rules should be rejected.
When randomly seeking, do not immediately resume decoding unless a directly encoded value is detected. Seek immediately past the next closing chunk instead, and resume decoding there.
STF-7's byte order mark is: 0x3A 0x2F 0x3A 0x7E
.
As STF-7 does not directly encode the entire ASCII range, certain single ASCII values are represented using pairs of ASCII values in the encoded output.
Consider, for instance, the hypothetical case of filtering JavaScript out of HTML. As such, we are going to be filtering our text for any occurrences of <script>
in our encoded text. We can typically do this reliably under most code pages because the Basic ASCII range (0x00
-0x7F
) is encoded as-is. But with STF-7, <script>
is encoded as ${script$}
, thereby allowing it to evade filtering. The encoded text can still be decoded later in the process and executed with every script block left in tact.
Therefore, STF-7 should only be implemented as a part of a tightly-controlled system. Encoding should happen immediately before entering a 7-bit transmission medium and decoding should happen immediately after exiting this medium.
Below are some samples of how STF-7 encodes different characters.
Text | STF-7 |
---|---|
the quick brown fox jumps over the lazy dogs | the quick brown fox jumps over the lazy dogs |
Hello, world! | Hello#{ world#< |
Emoji: 👦 | Emoji$_ ":%'[ |
Ethnically-modified emoji: 👦🏿 | Ethnically#|modified emoji$_ ":%'[":$:~ |
Καλά, ευχαριστώ | $*_$,<$,`$+{#{ $,@$-@$-\$,<$-<$,^$->$-?$-} |
иметь много общего | %$]%${%$@%%=%%{ %${%$|%$}%$>%$} %$}%$<%%^%$@%$>%$} |
אני מדבר קצת עברית | &.;&/;&.^ &.}&.>&.<&/] &/\&/[&/_ &/=&.<&/]&.^&/_ |
נִבְזֶה וַחֲדַל אִישִׁים | &/;&,?&.<&,;&.[&,[&.? &.@&,\&.\&,=&.>&,\&.{ &.;&,?&.^:,#_&,?&.^&.| |
حفلات تخرّج | '#|'%<'%?'#\'#_ '#_'#}'$<'&<'#{ |
दोपहर के बाद नमस्कार | *#[*%`*#_*$^*$; *"@*%\ *#{*$}*#[ *#]*#}*$]*%|*"@*$}*$; |
私の酒です。 | (*-<$!'}*"&=$!'\$!&^$!!= |
안녕하세요! | -&%],"&@.&&]-"$]-'*?#< |
제가 美國人입니다: 🇺🇸🧑 | -)"{+-!; (:)}&(!`%/,_-()@,#-],#/?$_ ":":_":":]":*.< |
事实胜于雄辩。 | %/)`&,*})!.{%/)}*'-?):+^$!!= |
♔♕♖♗♘♙♚♛♜♝♞♟︎ | #'&?#'&@#'&[#'&\#'&]#'&^#'&_#'&`#'&{#'&|#'&}#'&~:/!} |
The below table compares how many bytes STF-7 requires for different scalar values compared to other encodings. UTF-7 is not listed as it is rather freeform and arbitrary.
Low | High | STF-7 | UTF-8 | UTF-16 | UTF-32 |
---|---|---|---|---|---|
U+00 | U+7F | 1|2 | 1 | 2 | 4 |
U+80 | U+FF | 2 | 2 | 2 | 4 |
U+100 | U+7FF | 3 | 2 | 2 | 4 |
U+800 | U+FFF | 3 | 3 | 2 | 4 |
U+1000 | U+FFFF | 4 | 3 | 2 | 4 |
U+10000 | U+FFFFF | 5 | 4 | 4 | 4 |
U+100000 | U+10FFFF | 6 | 4 | 4 | 4 |
To the extent possible under law, William Swartzendruber has waived all copyright and related or neighboring rights to STF-7: The Silly 7-bit Transformation Format.
This work is published from: United States of America