stuff i've learned through creating Shodouka

Japanese encoding schemes and the WWW

This document describes how you can work with the three encoding schemes for Japanese text, and an opinionated plug for the best scheme for the WWW based on conflicts between these schemes and HTML.

Japanese transfer encodings

The three most popular encoding schemes for Japanese text are the following:

All Together Now

What happens when you receive a document with an arbitrary encoding scheme and you need to figure out how to interpret it? Let's combine the maps for all three encoding systems, together with ASCII (which we'll represent as a "first byte"), and see what we get.

The ASCII region shown on the map corresponds to the subset of ASCII understood by HTML: all of the printable characters from 33 to 127, tab (9), linefeed (10), carriage-return (13), and space (32).

Those ugly black lines running through the map represent three special characters in HTML: the ampersand ("&") and angle brackets ("<" and ">"). They should be ugly, because if your text inadvertently uses these characters, it will confuse browsers into looking for markup tags, which can make your documents look very ugly.

Notice how the intersecting regions show that a few possible conflicts exist.

Your only choice is ISO-2022 encoding if you are limited to 7-bit transmission. If you wish to display characters in languages other than Japanese as well, you'll also have to use the ISO-2022 way of switching character sets. But as the figure shows, ISO-2022-encoded Japanese is likely to contain any or all of the three magic markup characters. Of course, these special characters should be turned into the SGML entities &, <, and > -- but nobody seems to do this. So any browser which does not make special provisions for these characters will eat HTML for dinner.

Luckily, practically every HTTP (HyperText Transfer Protocol) server supports 8-bit clean transmissions, so you can use EUC. When text is encoded using EUC, it's easy to separate ASCII from JIS right away.

If all three encodings were disjoint (that is, all occupied independent regions on the map), you'd be able to detect upon receipt of a single character immediately what the encoding scheme is. Shift-JIS, however, goes and ruins things by intersecting with the EUC encoding when the first byte is from $e0 to $ee and the second byte from $a1 to $fc (the purple region labelled with a question mark). If it so happens that all the characters you receive are within this region, you have no way of knowing what the encoding scheme is.

A very short while ago i said this:

Assuming you know that your document is going to be Japanese, EUC-JP makes the most sense on the WWW, because it is instantly distinguishable from ASCII (unlike the evil Shift-JIS) and doesn't conflict with special characters used by HTML (like ISO-2022-JP).

But now, as i try to implement multilingual capabilities for Shodouka (Chinese, Japanese, and Korean), my opinions are changing. When text is encoded with EUC, it does indeed stand out from ASCII -- but it then becomes indistinguishable from any other locale-specific EUC encoding, such as EUC-encoded Chinese or Korean, for instance. This can make life quite difficult.

The worldwide standard for multilingual document encoding is ISO-2022. By using other escape sequences besides the ones mentioned in the page on Japanese encoding, you can indicate the presence of other languages. So ISO-2022 really has become my first-ranked choice.

The reason i previously considered it inferior is that it may contain the special characters which indicate HTML markup. But a possible way around this is to set the high bit on each byte (like EUC) and also surround Japanese with escape sequences (like ISO-2022-JP). This would seem to be the best of all worlds, as it should be readable by programs that decode EUC and ISO-2022-JP alike, while also avoiding the HTML markup problem. Imagine that!

So for the moment i believe escape sequences together with 8-bit shifted text are the way to go.

Support

Because of the ambiguity between Shift-JIS and EUC, there is usually a division between support for ISO-2022-JP and EUC and support for Shift-JIS, though some software will support all encodings.

A summary of ways to display Japanese on the WWW is available on this site.