stuff i've learned through creating Shodouka

Japanese text encoding

Japanese character sets

There are a number of Japanese character set standards, all of which are identified by a code starting with "JIS", which stands for "Japanese Industrial Standard". The most popularly-used Japanese character set is known as JIS X 0208-1990. It includes 6879 characters, among which are the hiragana and katakana syllabaries, 6355 kanji, the Roman, Greek, and Cyrillic alphabets, the numerals, and a number of typographic symbols. The characters are arranged in a 95-by-95 grid, which usually becomes a row number from 33 to 126 and a column number from 33 to 126. In most common discussion, "JIS" when not followed by a particular standard number refers to the JIS X 0208-1990 character set.

Japanese transfer encodings

With JIS X 0208-1990, there are many more distinct characters than can possibly fit in a single byte. So the solution is to use an encoding scheme to send each value as two bytes. Because a lot of communication on the Internet still takes place in ASCII, it is also desirable to encode JIS in a such a way that it can be distinguished from ASCII. There are a few different ways to do this.

Please note that while there are three encoding schemes, all of them encode (effectively) the same character set. Be sure you understand the difference between a character set and an encoding scheme before you go on.

ISO-2022-JP (JIS) encoding

ISO-2022 defines a standard way to send data in multiple character sets when the transmission medium supports 7-bit bytes. This is done by including "escape sequences" in the text; that is, special codes that indicate a switch between character sets. Each escape sequence begins (take a wild guess!) with the "escape" character ($1b). There are many registered escape sequences for different character sets and languages; ISO-2022-JP recognizes a subset of these escape sequences relevant to Japanese.

sequence      hex values       effect   

Esc ( B       $1b $28 $42      switch to ASCII
Esc ( J       $1b $28 $4a      switch to JIS Roman (JIS X 0201-1976)

JIS Roman runs from 0 to $7f and is identical to ASCII except for a few minor differences (notably, the backslash at 92 is instead a yen symbol, and the tilde at 126 is replaced by an overbar). For most practical purposes, JIS Roman and ASCII can be considered the same, so both these escape sequences can be treated as a switch to ASCII.

sequence      hex values       effect   

Esc $ @       $1b $24 $40      switch to JIS C 6226-1978
Esc $ B       $1b $24 $42      switch to JIS X 0208-1983

Both JIS C 6226-1978 and JIS X 0208-1983 are earlier versions of JIS X 0208-1990. For most practical purposes, both these escape sequences can be treated as a switch to JIS X 0208-1990.

Typically, then, Japanese text appears enclosed by two escape sequences: either Esc $ @ or Esc $ B at the beginning, and either Esc ( B or Esc ( J at the end. The text itself between the escape sequences consists of pairs of plain 7-bit bytes in the printable range from $21 to $7e, simply formed by splitting apart the JIS value into two bytes, also known as "raw JIS". Because the data itself matches the original JIS character numbers, the ISO-2022-JP encoding method is also known as "JIS encoding" (not to be confused with the "JIS character set"!). The figure shows the encoding range for this method, with each pixel corresponding to one possible combination of first byte (j) and second byte (k). The pixel colours describe conversion to another system; read on.

EUC-JP encoding

EUC, or Extended Unix Coding, takes advantage of mediums that support 8-bit bytes. It's a very simple and straightforward solution: to distinguish Japanese characters from ASCII, simply add 128 to each JIS value by setting the highest bit of each byte.

If j and k are the original JIS values and e and f are the transmitted EUC bytes, then

e = j + 128

f = k + 128

This pushes all the EUC codes up into the top half of the 8-bit range. They land from $a1 to $fe, where they have no chance of getting confused with ASCII codes from 0 to $7e. Nice and easy.

Shift-JIS encoding

This encoding method is easily the messiest of the three. It's also known as SJIS or MS Kanji. It was dreamed up by some folks at Microsoft for the Japanese support in Japanese versions of their operating systems and software, and it's very ugly. This method also requires an 8-bit medium, but doesn't behave by keeping everything neatly above the 128 mark. Instead, you are only guaranteed that the first of each pair of bytes is above 128; bets are off for the second.

The JIS values get all rearranged in order to reserve the range $a0 to $df for a set of 64 half-width katakana; to accomplish this, the characters are squashed into half as many columns (values for the first byte) but twice as many rows (values for the second byte). As it turns out, these half-width katakana are rarely used anyway.

The figure shows the encoding ranges for JIS: the first byte will land either from $81 to $9f or from $e0 to $ef, and the second byte will land either from $40 to $7e or from $80 to $fc. You might notice that the encoding range excludes $9f to $fc for the second byte when the first byte is $ef. That's because JIS has 95 columns, which doesn't evenly squash in half.

The colours of the pixels in these three maps illustrate how to perform the necessary contortions to convert to or from shift-JIS. If you look closely at the maps for ISO-2022-JP and EUC, you'll see that the squares (apart from being split between red and blue at 96 and 224) actually have alternating dark and light columns. Each pair gets joined into one long column for shift-JIS, as the colours in this map demonstrate. If s and t are the transmitted JIS bytes, then

when j is from 33 to 96 (red/orange), s = (j+1)/2 + 112
when j is from 97 to 126 (blue/turquoise), s = (j+1)/2 + 176
when j is odd (red/blue), t = k + 31, plus one more if k > 95
when j is even (orange/turquoise), t = k + 126

Whew.

All Together Now

So what happens when you receive some arbitrary document and you've got to figure out how to interpret it? Have a look at the page about working with three encoding schemes on the WWW.