Text Encoding

This little set of documents is an attempt to explain what i've learned about text encoding over the past little while, in the hopes that those who are curious may also understand. Please keep in mind that i claim to be no authority on the subject, but merely present this information in my own words, to the best of my knowledge. If you find errors within, please be sure to tell me about them. If this makes things clearer for you, i'd be happy to hear from you as well.

Some of the numbers i mention here will be in hexadecimal, because they tend to round out nicer that way. The hexadecimal digits go from 0 through 9, then A (for 10) through F (for 15), and each digit position is 16 (rather than 10) times more significant than the position on its right. To tell you that a number is in hexadecimal, i'll precede each one with a dollar sign.

Character Sets

A character set is a table which places a finite set of symbols in a one-to-one correspondence with a set of distinct integers. Our familiar companion ASCII, for instance, defines a meaning for each of the integers from 0 to 127; it includes, among other mappings, the association of the value 65 with the capital letter "A".

But the set of integers does not have to start at 0, end at any particular maximum value, or even be contiguous. For example, the character set used by HTML is only a subset of ASCII, which includes tab (9), linefeed (10), carriage-return (13), space (32), and the usual printing characters from 33 to 127.

Character Transfer Encodings

A character encoding is a scheme for representing the numeric values in a character set for a particular mode of transmission. Be careful not to get this confused with a character set, for it is quite a different animal.

In almost all situations, the goal of a character encoding is to map the numeric values of the characters on to a stream of 7-bit or 8-bit bytes, since that's how most transmission takes place. In the case of ASCII, we never have to worry about character encoding, because the range of those values is exactly the range of a 7-bit byte (from 0 to $7f). With some character sets, however, more than 128 different characters are required. That's where all the fun comes in.

Character sets and encoding schemes


copyright © by Ping (e-mail) updated Fri 3 May 1996 at 18:52 JST
since Fri 3 May 1996