Character Encoding

From: http://www.yale.edu/pclt/encoding/index.htm

Version: Nov. 20, 2002
Copyright 2002, Howard Gilbert

The first step in supporting International character sets is to get a current Browser. Internet Explorer, Netscape 7, Mozilla, or Opera will do. An old Netscape 4 browser won't work. You may need to install optional character sets. If the following three boxes don't display in Thai, Russian, and Hebrew, you don't have International support.

"เอกชัย ศรีวิชัย"สุดทน เจอเทปผี ซี.ดี.เถื่อนแย่งตลาด สั่งลูกน้องไล่กระทืบพ่อค้าขายซี.ดี.เถื่อน พร้อมประกาศลั่น ฝากบอกเอเย่นต์ใหญ่ว่า "เอกชัย สั่งให้ทำ" ยันไม่สนใจใครอยู่เบื้องหลัง มี И все это произошло из-за подарка, который Франция решила преподнести Петербургу к юбилею. Авторство идеи принадлежит Кларе Хальтер.
סאמחו חתפל תפתושמ הדעו
לארשיב םיעוגיפה תייגוסב ןודת
תקספהל בייחתהל סאמח הבריס חתפ םע תוחישב
תונדפקב ןיינעה ןוחבל המיכסה ךא ,םיעוגיפה
 
ןב ףולאו ןמלבוס לאינד ,ןייטשניבור ינד

A Web Browser can display current news in every modern language. The user only has to know how to point and click to get documents from around the world. The Browser takes care of the details.

Producing a Web page in any single language is no problem. For decades the computer vendors have been customizing their hardware, systems, and applications for various national languages. Programmers and authors simply use the tools they have been using all along.

A new problem occurs, however, when  an editor, application, library, or portal seeks to combine information from sources in many different countries and present it consistently. The national solutions adopted in different countries are not directly compatible. The information has to be reprocessed into a common format, and that processing has to conform to current standards.

This paper will emphasize the Web and its standards (primarily HTML and XML), although the same information applies to programming and other applications.

Terms

Experts have developed a certain precise vocabulary to explain the issues of international text processing. Unfortunately, the developers of Web standards have not used these terms with precision or consistency. It is too late to go back and fix the errors, but the first step to understand the mess is to define the terms.

Character

The term "character" refers to an abstract concept (and not a numeric computer code, a physical mark on paper, or a bit pattern displayed on the screen). A character is more about meaning than shape. For example, the capital letter G can be printed or written in cursive script. It may appear illuminated in a Medieval manuscript. Its still the same letter and has the same meaning in words.

The various forms in which the character can be represented physically are given the technical name glyphs.

Uppercase "G" and lowercase "g" may be the same letter, but they are different characters because they have different semantics. There is something about proper names, something about the start of sentences, and even some case sensitivity in Unix and C. The difference between the printed and cursive capital letters, however, is a matter of display.

Looks can be deceiving. Consider the characters "A", "Α", and "А". The first is our capital letter A. The second is Greek and the third is Cyrillic. They are three distinct characters, although they are displayed identically. Σ is a character in the Greek alphabet, while is the mathematical sigma used to sum a set of terms in a mathematical equation.

Sometimes different characters can mean exactly the same thing, but are displayed differently. Consider the simple quotation mark ("). Word processors often replace it based on context with the separate left and right quotation mark characters ( and ). In some countries of Europe, however, quotations are delimited by a different form of quotation mark (« and »).

Characters can also be formed to represent typographical versions of combinations of characters. For example, ½ is a character called "the vulgar fraction one half". It is a single character, as distinct from the three character sequence 1/2 that looks enough like it to pass for all practical purposes. The German character ß is a substitute for the two letter sequence "ss", and œ is used in some languages as a substitute for the two characters "oe".

However, we have been forced by first typewriters and then computers to make do with a limited number of characters. In some cases a single character has been used for two completely different meanings. If it had been less important, these two meanings might be expressed today by two different characters.

For example, the dash "-" is sometimes used as a hyphen between words and syllables, though it is also used as the minus operator in mathematical expressions. Modern expanded character sets have several characters that are alternate forms of the hyphen, and other characters that alternate forms of minus. However, the plain ASCII character "-" on every keyboard is ambiguous. Its meaning is determined by the context in which it is used.

Character Set

A character set is a collection of characters that can be entered, stored, or displayed by a program. Computers don't provide support by the individual character. Instead, support for an entire set is installed in one operation.

The characters that everyone uses to design Web pages or program computers is most commonly called "ASCII". This stands for the "American Standard Code for Information Interchange" and the "American" part shows that ASCII is very much a US standard. It contains 95 "graphic" characters (with the qualification that "blank" is regarded as a graphic character because it takes up space).

Thirty years ago, when computer equipment and communications were less powerful, some foreign language support was achieved by replacing some characters in the ASCII designated as "national use characters" with other characters needed in other languages. The most obvious target is "$" which could plausibly be replaced by the symbol for a different currency. If dollars remained important, then "#" was a popular second choice for the local currency. However, for at least the last 15 years all equipment and networks have been able to support larger sets with more characters. Today the full US ASCII character set is the universal starting point, and foreign character sets are created by expanding ASCII with additional characters rather than by substitution.

When there were tight limits on the number of characters, computer vendors created character sets that were not just targeted to specific languages, but also to specific countries. For example, an IBM character set for France was slightly different than the set for Canadian French. Now that computer networks connect every home user to every country in the world, such intense specialization is inefficient.

The same element typically appears in more than one language. For example, the cedilla mark (ç) is most commonly recognized as a feature of the French language, but it is also used in Albanian, Catalan, Portuguese, and Turkish. It is more efficient to develop character sets that cover a broad range of related languages that share common characters.

There are standard character sets that provide all the characters needed for particular languages or regions. The most popular extended character set is called "Latin 1" and contains characters needed for Western languages:

Latin1 covers most West European languages, such as French (fr), Spanish (es), Catalan (ca), Basque (eu), Portuguese (pt), Italian (it), Albanian (sq), Rhaeto-Romanic (rm), Dutch (nl), German (de), Danish (da), Swedish (sv), Norwegian (no), Finnish (fi), Faroese (fo), Icelandic (is), Irish (ga), Scottish (gd), and English (en), incidentally also Afrikaans (af) and Swahili (sw), thus in effect also the entire American continent, Australia and much of Africa. The most notable exceptions are Zulu (zu) and other Bantu languages using Latin Extended-B letters, and of course Arabic in North Africa, and Guarani (gn) missing GEIUY with ~ tilde. The lack of the ligatures Dutch IJ, French OE and ,,German`` quotation marks is considered tolerable. The lack of the new C=-resembling Euro currency symbol U+20AC has opened the discussion of a new Latin0. [From "ISO 8859 Alphabet Soup"]

The Latin 1 character set is small enough that each character can be assigned a code and still stay within the limitation of one byte of storage per character. However, every decision to subset involves some controversy. There is not quite enough room in the set to support the major Western languages (French, Spanish, German, etc.) and still squeeze in both Icelandic and Turkish. Any rational empirical decision would note that there are 278 thousand people in Iceland compared to 66.6 million people in Turkey. In one of the most disgraceful examples of geographical bias, the Latin 1 set, which became the default for most computer applications, decided to exclude Turkey in favor of Iceland. Of course there is another Latin set that includes Turkey and excludes Iceland, but it is not a widely used default.

Character sets provide a workable subset of characters for a particular set of countries. Some sets include one additional alphabet (Greek, Cyrillic, Arabic, Hebrew). Others include accented characters for a particular region.

Character Code

Each character in a character set has to be associated with a number that can be stored in computer memory. There are a number of standards that both define the characters in the set and assign each character a number. However, strictly speaking the selection of the characters in the set is one step, the assignment of codes is a separate step.

This was more important 15 years ago when mainframes were a larger influence in computing. Due to historical accident, the IBM mainframes had developed different code assignments for characters than the standard used on personal computers and the Internet. IBM supported the Latin 1 character set, but inside the mainframe each character was stored with a different code value. The internal mainframe codes could be easily translated to the external standard when data was transmitted over the Internet, and Internet data could be translated back when it was stored on the mainframe.

The last time that anyone screwed something important up in this area was during the design of the IBM PC. The Latin 1 character set had been clearly established, so the engineers knew what characters had to be displayed. The international standard code values were also available, but the engineers either did not know or ignored them. A story is told that with a deadline looming, the characters were more or less randomly assigned to code values during the airplane trip to the meeting when a final character table had to be presented.

This problem disappeared when DOS was replaced by Windows. In Windows, everything displayed on the screen is generated by software, so any or all standard character code systems can be supported natively. Portable software technology, like Netscape and Java, are also based on international standards. So modern software now supports the code values assigned by formal standards.

Encoding

Character data is stored on disk or transmitted over the network as a stream of bytes. When the characters are all ASCII, then the byte values stored or transmitted correspond exactly to the character code values. The letter "A" will be represented as a byte with a numeric value of 65, the code value assigned to that letter. This approach works as long as the character set is small enough that the highest code value is less than 256.

However, the World Wide Web, in order to actually be "world wide", requires a standard character set that includes all the world's languages. Therefore, the HTML 4.0 standard is defined over the "Unicode" character set. Unicode supports every alphabet, every accented characters, and even the ideographs of Chinese and Japanese. To do this, however, each character is assigned a two byte code value.

Now it would be possible to store all files on disk two bytes per character, but only if one were willing to give up half your disk space. In the United States this would be particularly unpopular because few Americans know any foreign languages and need the extra characters. Besides, we expect everyone else in the world to speak our language.

Curiously enough, this is not quite as unpopular in the rest of the world as you might think. ASCII really is the most important character set, because it is used to define the syntax of HTML and programming languages.

As an experiment, you might want to visit a Web page that displays news articles in Hebrew (HA'ARETZ) or Chinese (Hua Sheng Online). Now click on the browser option to View Source. While the page displayed in foreign characters, the first few pages of HTMLsource are mostly "English" JavaScript and HTML tags.

It would be inefficient to expand the ASCII characters everywhere they appear to a two byte sequence. Although it might be "fair" to other languages, in practice it wastes disk space and network bandwidth for everyone. Therefore, most character storage and transmission schemes store the ASCII characters in a one byte format, even if this means that other characters have to become larger to make room.

The same thing applies to modern computer programs. The Java language was defined so that character variables and text strings are internally Unicode and store each character in a two byte storage unit. However, external files are recognized to be a sequence of bytes called an InputStream or OutputStream. If the file contains characters, then the stream of bytes must be converted to a stream of characters (an InputReader or OutputReader). Although it is common to convert one byte to one character on input and one character to one byte on output, this is not the only way to encode characters.

HTML and XML define "numeric entities" that allow a programmer to represent any character in the Unicode set without leaving ASCII.  For example, to put ΦΒΚ (Phi Beta Kappa) into a Web page, the three Greek letters can be expressed in a pure ASCII HTML file as "&#" followed by the decimal numeric code values (or "x" followed by the hex numeric code) assigned to the letters in Unicode, with an ending semicolon. Thus Phi Beta Kappa is represented by "ΦΒΚ". This example required six ASCII characters to represent each special character. This is reasonable when most of the text is English and foreign characters are unusual.

Character Codes Predate Computers

Before computers were invented, corporate data processing was done using decks of punched cards. A deck of sales data could be run through a mechanical sorter to arrange the cards by region, customer, salesman, or product. Then the sorted deck was fed into a mechanical tabulator "programmed" with patch panel wires to sum up groups of columns containing quantity or dollar value to generate reports. IBM dominated this early information processing technology.

IBM introduced computers into this existing product line to act as a more sophisticated card tabulating device. As a result, IBM created a computer character code that matched the way that characters were punched on the cards. The punch card was not really a binary system. It had ten rows labeled "0" to "9" and three extra control rows. Mechanical adding machines operated on decimal numbers, not binary numbers. For each column on the card there was a wheel with positions numbered 0 to 9. To add a number punched on the card, the device would sense which of the ten holes was punched and then rotate the wheel the corresponding number of positions. A "0" punch left the wheel alone. A punch in the "5" row rotated the wheel 5 positions. When the wheel rotated from 9 back to 0, a 1 carried over to the wheel one column to the left.

When letters are punched on the card instead of numbers, some of the top three control rows were also punched. Nevertheless, IBM arranged the letters in a decimal system. "A" was represented by a punch in the 1 row and some punches in the control row. "B" had a 2 punch, and so on up to "I" with a 9 punch.  Then the letters wrapped back so "J" had a 0 punch and an alternate set of control punches in the top three rows.

When IBM created a character code for computers, it based the code on the decimal system used in the punch cards rather than binary values that would have been more natural for computers. The IBM code (called "EBCDIC") preserved the break between "I" and "J" and between "R" and "S".

Punch cards were a fairly expensive method of entering data. If the equipment had not already existed, it would not have been created just for computers. Other early computer makers wanted a less expensive system, but they were not prepared to design their own equipment. So they turned to a device created by another US monopoly: the phone company. AT&T had been producing Teletype machines for years to support transmission of text messages over the telephone network.

When telegraph signals were sent by hand, characters were transmitted by a series of dots and dashes. However, Teletype machines transmitted text electromechanically by alternating sound sent continuously over a phone line between two different tones. One sound, called "Mark", represented a 1, while the other, called "Space" represented a 0.  A teletype transmitted 10 characters per second. An operator could dial the phone, connect to another teletype, and type a message at the keyboard. Each key pressed on one machine was printed onto the roll of paper at the other end. However, long distance phone calls were expensive in those days. It was more efficient to prepare the message in advance by recording it on punched paper tape. Then when the phone connection was made, a paper tape reader connected to the Teletype device could transmit the message at the maximum speed of 10 characters per second.

By good luck, the teletype code was binary and was therefore ideal for computer applications. The Teletype only supported uppercase letters. With 10 numeric digits, 26 uppercase letters, a space, and the highest value (all ones) reserved for correcting typographical errors, This left room for 27 punctuation marks while still remaining within the 64 code values that would be permitted by a six bit code. However, the Teletype also required "control characters" that correspond to the typewriter operations of Return, Tab, Backspace, and Page Eject.

Teletypes had a few extra control functions like BEL ("Bell"). The printer had a little bell in it that rang when the BEL character was received. Teletypes connected services like AP to the newspapers and TV/radio stations. When an important story was about to be printed out, a few BEL characters were supposed to attract attention. The "Hot Line" between the President and the Kremlin was never red phones, but was instead a pair of Teletype machines.

So although the Teletype had only 64 graphic characters, it added an additional 32 control characters. That forced the code up to seven bits. A seven bit code has possible values from 0 to 127. The highest value is reserved for error correction. This meant the original ASCII standard had 32 control codes (0-31), 64 graphic characters(32-95), 31 unassigned code values(96-126), and DEL(127).

Standards are reviewed every five years. At its first revision, the previously unused 31 code values were assigned to support lower case letters and some additional punctuation. This provided a basic Latin alphabet that would support English language text.

Eight Bit Codes

Since modern computers store data in eight bit bytes, the obvious next step was to expand the seven bit ASCII code by one extra bit. Each bit doubles the number of possible code values, so an eight bit code might provide an additional 128 characters.

However, to create an eight bit character set within the International Standards Organization framework, rules established before there was an Internet or personal computer require that the first 32 new codes (values 128 to 159) be reserved for additional control characters. That reduces the number of new characters to 96.

Today the extra control characters are no longer needed because all formatting functions are performed inside the desktop PC. However, the first eight bit ASCII devices were terminals connected over phone lines to central computers. The extra control codes were used to more efficiently change the color of the foreground characters, change the background color of a line on the screen, display a phrase in italics, or jump around positioning text on different parts of the screen.

Western countries use the 26 letters of the Latin alphabet inherited from Rome. Other alphabets include Greek, Cyrillic, Arabic, Hebrew, and Thai. Some languages use the Latin alphabet but add additional accent marks (also called "diacritical" marks) to certain letters.

Within the limit of 92 characters, it is possible to add a second alphabet, or a reasonable number of accented Latin letters. Rather than creating separate character sets for individual languages, each standard groups several languages geographically. These standards are part of the family of ISO (International Standards Organization) 8859 project. Each standard in the family defines both a character set ( such as "Latin 1") and an assignment of one byte codes to each character. Remember, the first half of each of these standards is identical to ASCII:

  • Latin-1 (ISO 8859-1) contains all the characters needed for English, French, Spanish, Italian, German, Swedish, Icelandic, and basically all the other languages used in Western Europe.
  • Latin-2 (ISO 8859-2) contains the characters needed for Polish, Czech, Hungarian, Romanian, Croatian, Slovak, Slovenian, and other languages of Eastern Europe except for the Baltic states.
  • Cyrillic (ISO 8859-5) Russian, Bulgarian, Macedonian, and other Russian influenced languages.
  • Arabic (ISO 8859-6) North Africa and the Middle East.
  • Greek (ISO 8859-7)
  • Hebrew (ISO 8859-8)

Arabic requires a special note. There is no "printed" form of Arabic. It is, instead, a "cursive" text like handwriting. Arabic is written right to left, and each character has four different forms based on its position in a word. For example:

  • when the character appears by itself
  • at the start of a word (connection to the next character on the left)
  • in the middle of a word (connection from the right, continuing to the left)
  • at the end of a word (connection from the right, ending decoration to the left)

The eight bit code of ISO 8859-6 assigns one code to any character. This is preferred for data stored on disk or in a database. However, for text in this code to be printed correctly, or even written correctly on the screen, the programming used must provide logic to determine the context of the character and select the correct display form. There are expanded characters sets for Arabic that assign different code points to each presentation form of the character. They may have been important when a stream of bytes was transmitted to a dumb device that had to display the information. However, in the modern era of microprocessors, every device should be able to select the proper presentation form from the single eight bit code.

ISO standards created today are constrained by the rules of international standards to be compatible with original standards introduced in the early 1960s, and those standards evolved from communications standards that predate computers. The original reason for some components of a standard may have long since disappeared. For example, the character whose code value is 127 is reserved as a control character named DEL. Back in the days of Teletype machines, information was punched on paper tape. If an operator hit the wrong key and punched the wrong character, there was no way to un-punch the holes in the tape. However, one could back the tape up one position, hit the DEL key, and punch out all 7 holes. DEL or 127 corresponds to the binary value of 1111111 (seven ones). So this code value was reserved as an additional control character. When information was transmitted or copied from one paper tape to another, the DEL characters on the input tape were skipped.

DEL and the other 64 control characters take up a lot of room in the limited range of values from 0 to 255 that can be stored in a single byte. In modern computers, where data is entered and corrected with screen editors, where information is transmitted over the Internet, where rich formatting is driven by HTML tags, and where printed pages are formatted by Windows or Macintosh desktop publishing, most of the control characters serve no useful purpose. The ISO standards cannot, however, reclaim the reserved code values.

This becomes a problem for languages that can fit into a one byte data space of 256 characters, but not into the ISO 8859 limit of 192 characters. Vietnamese, for example, has too many different versions of accented characters and so is frequently stored in a non-standard eight bit code.

The video adapter design for the first IBM PC included strange graphic characters (smiley face, heart, spade, club, diamond, etc.) to use every one of the 256 possible byte values. Subsequent adapters could also be programmed with alternate arrangements of graphic characters. IBM referred to these as "Code Pages", perhaps to not confuse them with standard character sets like 8859-1.

The current design of Windows preserves the idea of a Code Page, although Microsoft implements them as a proper superset of ISO standards. The ability to assign graphic characters as a substitute for the no longer meaningful control characters gives Microsoft the opportunity to support more languages, more quickly, and more completely than the slow moving standards process. Today, Web content from Latvia, Lithuania, and Estonia is as likely to be in the "WinBaltic" code page as in any ISO standard.

Fifteen years ago the Latin 1 alphabet and the ISO 8859-1 standard covered most of the computers in use outside Japan. Eastern Europe was behind the Iron Curtain, and computer networks were mostly national in scope. So 8859-1 became the default character set in a wide range of applications and systems, from the PostScript printer system to the HTTP network protocol. In the last five years it has become clear that no eight bit code is broad enough to remain as a default.

Bidirection Character Sets

Hebrew and Arabic characters are read and written starting at the right margin and moving left. However, the ISO 8859-6 and -8 standards also include the Roman alphabet to represent HTML tags and programming language source. Computer source typically has a mixture of both types of characters, and tools that support such a mixture are said to be "bidirectional".

Programs don't have a problem. If instead of reading text from the screen you were able to open the file and read the data from disk or from the network one character at a time, everything arrives in exactly the correct order. As data, the first character arrives first and the first word has arrived when a blank is encountered. Latin characters arrive in the order they would be read. Hebrew and Arabic characters arrive in the order they would be read. The period at the end of the sentence arrives at the end of the sentence. That may not seem remarkable, but now consider how this perfectly reasonable binary stream of data gets messed up when it has to be presented to the human eye.

Consider the following sentence borrowed from the description of Unicode at the http://www.unicode.org/ site. It appears first in English and then in Hebrew. Hebrew sections are color matched to the corresponding English translation:

Unicode is required by modern standards such as XML, Java, ECMAScript(JavaScript), LDAP, CORBA 3.0, WML, etc., and is the official way to implement ISO/IEC 10646

יוניקוד נדרש על-ידי תקנים מודרניים כמו XML, Java, ECMAScript(JavaScript), LDAP, CORBA 3.0, WML וכדומה, ומהווה למעשה את היישום הרשמי של תקן ISO/IEC 10646.

The second paragraph is justified to the right margin and the contents of the HTML tag are displayed starting on the left and moving to the right. To achieve this display, the paragraph tag contains the dir="rtl" attribute (for direction is right-to-left).

To read a RTL block of mixed text, read each line right to left, top to bottom. Start at the right margin and look for the first (rightmost) block of contiguous Hebrew or Latin characters. If the block is Hebrew, read the characters right to left. If the block is Latin, read the characters left to right. When all the characters in the block are read, if there are more characters to the left of the characters you just read find the next block. If not, start at the right margin of the next line and find the next block.

To see what this really means, make the Browser Window wider or narrower and see how the characters flow. There is some strange behavior. For example, if the leftmost (last) characters on one line are Latin characters and the rightmost (first) characters on the next line are also Latin, then as you squeeze the window to make it narrower text "flows" from the middle of the top line to the middle of the next line.

Numbers, however, are a problem The characters we use for 0 to 9 are called "Arabic numerals", but they are still read left to right and are the same in any language. Algorithmically, numbers are not regarded as part of the Latin or any other alphabet. This can cause a problem with some Browsers when the "3.0" after "CORBA" or the "`10646" wrap to the next line. Note also that the period that ends the sentence comes at the end (leftmost) even though the text that immediately precedes it is Latin and therefore its characters were scanned left to right.

All of this is presentation. The algorithm to display the characters on the screen is complex. The reader has to follow a complex eyeball algorithm to scan the characters. On disk, in memory, or on the network, the bytes and characters all arrive in exact lexical order. So programs don't need to know anything about this complex presentational mess (except to add dir="rtl" to the tag).

Local Double Byte Codes

The Chinese, Japanese, and Korean languages have thousands of ideograph characters. This produced problems long before computers. When the printing press was introduced, each country had to develop simplified character sets that could plausibly be represented in type. However, even simplified sets may not fit in any eight bit code.

As computers and networks were introduced, each Far Eastern country developed several computer character codes. Generally these codes start with the ASCII character set and then add the local characters as multi-byte sequences.

Character codes are the simplest problem in supporting Far East languages. Keyboard input is a bigger issue. Screen display is an issue. Sorting and indexing is a problem. Over the years each country developed a body of hardware, programs, databases, and utilities to handle the local national standard character sets. This large body of existing material will not be replaced by efforts to develop some additional international character set.

A Japanese page may be in the "Japanese EUC" or "Shift-JIS" code, plus a few specialized alphabets. Chinese has two "simplified" codes and a larger "traditional" set. Any real data from a Far Eastern country will come in one of these national standard codes.

Unicode and ISO 10646

During the 1980s a group of computer companies were working on a single universal character code that could combine all the characters of all the languages in the world. They figured out a trick to keep the code down to a two byte value.

If you combine one existing character set for each of the three Far East languages, you end up with a set much larger than the 65536 possible values in a two byte code. However, a very large number of the three ideographic languages can be made to share the same code values if you adopt a certain historical perspective.

All Far East languages descend from an original written script developed during the Han dynasty in China. Down through the centuries the form and some of the meanings of the ideographs drifted apart in the three languages, and today the relationship cannot be easily identified. The idea was to assign a single code value to three ideographs in the three languages that appeared to descend from a common ancestor. This still left a few thousand more "modern" characters to fill in, but it kept the total number of code points within the two byte limit.

Requests subsequently appeared for a number of less important characters that were missed in the first pass. This pushed the character set beyond the two byte boundary. However, almost any important modern text can be expressed by staying inside the two byte limit.

Around the same time, a committee of ISO was also looking for a single unified code. They were aware of the Unicode idea, but the ISO membership is countries instead of companies. Initially each Far East country wanted to directly embed one of their traditional multi-byte codes in the final standard. The initial recommendation was monstrous.

The complete ISO group rejected this proposal and sent instructions back to the committee to look more carefully at Unicode. The combined efforts of the two groups improved the standard so today "Unicode" and "ISO 10646" are two different names for the same thing.

Complete information on Unicode can be found on the http://www.unicode.org/ site.

The simplest way to process a stream of Unicode characters is to store each character in a two byte field. If the data is predominately ASCII, either because it contains mostly English language text or because of the HTML tags and JavaScript logic, then the data can be represented more compactly in a format called "UTF-8". In this form, the ASCII characters 0-127 are represented as a single bytes. Any foreign character, even other Latin 1 characters in the range 128-255 are converted to a multi-byte sequence, and some characters will be expanded to three bytes.

Unless a text is predominately composed of Far East languages, UTF-8 is usually the most compact external form for Unicode. However, if a file consists exclusively of characters from one particular national language, then it is probably more efficient to use a local eight bit or legacy Far East code.

Because Unicode assigns a single code value for three Chinese, Japanese, and Korean ideographs that have quite different forms, a Web page that contains Far Eastern Unicode data must specify which of the three languages is associated with any block of text. HTML 4 adds the "lang" attribute to all HTML tags for this purpose.

If a paragraph begins with <p lang="ja"> then the Browser knows to display its Unicode contents in Japanese ideographs. Had the paragraph begun <p lang="zh"> then the Browser would have generated the same codes as Chinese ideographs.

XML has a similar attribute. However, since XML syntax is defined by the user or application and some preexisting object might already use an attribute called "lang" for some other purpose, the XML language selection attribute is qualified by a namespace as "xml:lang". An object might include a tag:

	<description xml:lang="zh"> ... </description>

Unicode characters that are part of the content of this tag and have code points in the ideographic character range would then be identified as Chinese. XML can only apply attributes to an entire tag contents, and there is no provision to distinguish one string of characters in a tag from another. Therefore, XML itself doesn't allow a tag to contain a text phrase in Chinese and another text phrase in Japanese. HTML doesn't have this problem because any HTML structure can contain a <span lang=".."> ...</span> structure. If someone is designing an XML schema and requires the ability for a tag to mix different type of ideographs, then the schema must permit some subsidiary tag that can delimit spans of national characters.

Localization and Internationalization

Computers are configured to operate in the local language of the country in which they are installed. Keyboards not only support national characters and accents, but the layout of keys on the keyboard vary from country to country ("QWERTY" in the US, "AZERTY" in France, "QWERTZ" in Germany). Utilities expect the local character set, and files are probably written to disk in the same national code used ten years ago.

There may still be some equipment that substitutes local characters for the "national use" characters of ASCII. However, the rapid replacement of computers and printers has probably migrated most use outside of the Far East to one of the eight bit character sets. In the Far East, the local character set will be one of the traditional multi-byte character sets (in Japan, for example, either EUC or Shift-JIS).

Most foreign language files are exchanged with other people in the same country or region. So if both the producer and consumer of the text default to the same character set and encoding, the question of international standards doesn't arise.

It would not have been necessary for the Web standards to move to Unicode just to display text from different countries on the same page. A variety of HTML constructs (IMG, IFRAME, OBJECT) allow a "Web page" to be composed from different sources each with its own format. The IMG, for example, references an external image file in GIF or JPEG format. If it was necessary for the Browser to combine data from incompatible character sets, each block of text could be transmitted from the Web Server with its own Content-Type header with its own "charset" designation.

However, in the HTML 4 world of dynamic content manipulated through the CSS and the DOM, it would be almost impossible for a Browser to manage content unless it had all been reduced internally to a common character set. Given the state of modern technology, the Browser has to be programmed to use Unicode.

If the Browser had to use Unicode internally, then the HTML and XML standards might just as well accept Unicode as the reference character set in which each markup language standard is defined. Now a single Web page can contain text from as many different languages using as many different character sets as the author might choose, without the requirement that each different language segment be isolated in its own file.

Language Standards and Character Sets

Standards for programming languages and the Web are defined in terms of abstract characters instead of codes. For example, the languages based on C (including C++, C#, Perl, Java, and JavaScript) all delimit a block of statements between the brace characters "{" and "}". In ASCII and all the other international standards, these characters are assigned the code values of 123 and 125. However, a programmer can create a C program on an IBM mainframe where the EBCDIC character set represents these two characters as 192 and 208.

The syntactic elements of every programming language or Web standard attach significance only to the characters in the ASCII subset (except for mainframe languages like PL/I that used a few IBM characters like "¬"now mapped to the second half of the Latin 1 set). In an HTML or XML file, the language elements (tags, comments, directives) begin when the "<" character is encountered in the stream and end with the ">" character. A browser, editor, or utility must process the stream character by character to determine its syntax and structure.

Text literals in the program and HTML or XML tag content can contain a much larger range of characters. Older languages supported "characters" only in the sense of an array of bytes. The program could store and process the bytes and remained indifferent to any particular one byte character encoding that might be associated with the data.

Modern languages (Visual Basic, Java, C# and the other .NET languages) store characters as an array of two byte units. They support character strings and literals in the full Unicode set. Similarly, the HTML 4 and XML standards are defined over Unicode as the base character sets, so browsers and utilities that support these standards must internally process all text as Unicode.

What does it mean to say that HTML and XML use Unicode as their base character set? Tags still begin with the "<" character, and that character remains the same whether the file is coded in ASCII, 8859-1, Shift-JS, or UTF-8. The same can be said for the other base semantic characters like ">", "=", "&", and the quote marks. HTML goes further to insist that the tag names and attributes be the familiar names in the Latin alphabet. Again, the "<body>" and "</body>" elements are the same in ASCII, 8859-6, or UTF-16, and English speakers will be happy to know that the "body" tag name remains the same even if the text is all French (take that, you Quebequois nationalists).

However attribute values, comments, and the content of tags that generate text on the screen can be formed from any Unicode character. XML goes further by allowing tag and attribute names to be formed from national language characters.

There are a few cases where Unicode characters are stored on disk as an array of 16 bit characters. The Windows NTFS file system, for example, stores file names and attributes as two byte Unicode characters. However, most file contents is compressed in UTF-8 or converted to one of the eight bit codes before it is stored as file or database contents.

Newline and XML 1.1

When users entered programs or data into an early IBM mainframe, they punched each statement or item onto a separate punch card. The computer read the deck in one card at a time. When data was stored on disk, each card image was stored in a separate "record". A widely used mainframe file format precedes each variable size text record on disk with a two byte integer length field. In such a system the break between consecutive lines of text in a file is a physical, structural difference and is not denoted by any character in the character set.

To end one line of text on a Teletype machine and begin typing data on the next line, it was necessary to transmit two control characters. The "Carriage Return" (CR) character moved the typing element back to the beginning of the current line. The Linefeed (LF) character advanced the paper to the next line. The characters could be used separately. To underline a title, the Teletype could send the text characters, then send just CR to position back to the start of the line, and then type a sequence of underscore characters. "This is the end" followed by CR followed by "______________" produced "This is the end". A sequence of Linefeeds skipped blank lines.

The early US and International communications standards required the two character sequence CR and LF as a standard boundary between lines of text. This was a statement about how lines of text should be transmitted between machines and devices, but it did not impose any internal requirement on how data is stored on disk. Nevertheless, the Digital Equipment Corporation (DEC), the number 2 computer company during the mainframe era, designed its operating system to use CR and LF as a two character separator between lines of text. The first microprocessor operating system CPM was based on the DEC model, and DOS in turn was based on CPM.

It would have been intolerable to require people to type two keys on the keyboard to end each line. On most terminals the largest key was labeled "Return" and generated CR. Most computers would read CR as the end of line (from a human point of view) and logically add the missing LF if needed. However, it became obvious that two characters to separate each line was a waste of disk space. The Macintosh operating system adopted the user-centric convention and used CR to separate lines of text in files. Unix made the other choice and put LF between files.

CR and LF each provided half of the new line function. None of the other 32 original Teletype control characters provided a reasonable substitute. IBM's EBCDIC system started as an eight bit code with 64 control characters. One control character corresponded to CR, one to LF, and IBM added a third character designated "Newline" that combined the two functions.  When ASCII was extended from seven to eight bits, it added an additional 32 control characters with code values from 128 to 159. The new control character with code value 133 is called NEL and corresponds to the combined newline function.

The extra control functions were mandated for every eight bit extension of ASCII. They are part of every one of the ISO 8859 family of eight bit codes. They are also embedded in Unicode/ISO 10646.

There is a strong bias in the computer community toward Unix conventions. People are certainly entitled to their personal preferences, but a Standard has to follow fixed rules. The line separation function is not necessarily a character. Some programming languages like C and Java require the file I/O support to reserve one code value from the character set that can be mapped to the language literal '\n'. This character is inserted between lines as they are fed to the programming language. However, this character may be used to replace a two character sequence on some operating systems and it may be manufactured and inserted on other systems where lines are a structural feature not delimited by characters.

When data is transmitted across a network between different operating systems, it is not possible to assume that any particular line separation protocol will be used. If the US, International, and early Internet standards were followed today then lines would be separated with the two character sequence CR and LF. However, Unix and C have such a strong influence on current programming that it is common for lines to be separated by LF.

The original XML 1.0 standard forgot the second set of 32 control characters. It excluded all control characters in the code value range 0-31 except for three: HT (Tab), CR, and LF. Iit then included all characters above Blank (code value 32) thereby including DEL and second 32 control codes. The standard defined line separation to be alternately designated by CR, LF, or CR+LF and allowed any processing program to normalize line separation to its own internal favorite form.

The XML 1.1 revision corrects these errors. It excludes DEL and 31 of the extra control codes, allowing only NEL to appear as text in the document. It then revises the line separation definitions to include forms based on NEL as well as CR and LF.

In HTML a Newline is ignored except in a block of preformatted text. In other contexts a line break is generated by the <br /> tag, but physical lines in the source are ignored. XML, however, regards all Newlines inside a tag as content. This means, however, that if XML files are to be fully transportable, each recipient of XML data must be willing to recognize and process any of the generally recognized character sequences that denote a Newline on different systems. A middleware product is free to convert incoming Newline constructs into the format native to the middleware platform, because when the file is transmitted a similar burden is placed on the next system to recognize all legitimate forms of Newline as interchangeable.

The W3C may make standards for the Web, but its authority derives from universal agreement and deference. It is subsidiary to the International Standards Organization and the formal standards making bodies that have an ISO mandate. W3C Standards, such as XML, must honor in every detail the character set standards like ISO 10646 on which they are built. Revision 1.1 brings XML into compliance.

Canonical Forms and Normalization

The use of diacritical ("accent") marks varies from language to language. Sometimes the mark changes the way that the character is pronounced. Sometimes it represents a tone shift. In some languages the "letters" are consonants and the "marks" are vowels. In some languages the diacritical marks are components that must be assembled together in order to form a meaningful character.

If the mark simply changes the pronunciation of the letter, as in most European languages, then the accented letter is plausibly a single character. Most modern character codes have a unique code value for characters like è (e with grave accent) and ç (c with cedilla). However, when a mark acts as a vowel then the mark really is a separate character even though it is displayed above or below some other character.

In the days before computers, European typewriters had a method of typing characters without advancing the print element to the next position. The previously typed character could then be overstruck with a second character producing a compound result. Thus the letter "e" and grave accent " ̀" would be two separate print elements that, struck together in the same location, displayed as è.

Unicode preserves the concept of "combining" characters, although their use is recommended for more exotic languages than French. Characters in a particular range of codes implicitly combine with the character before them to produce a compound display. This means, however, that there are two ways to display certain accented characters. They can be generated as a single character code, or as a combining character sequence.

There is also the problem of ligatures, where two letters combine to form a composite symbol. The difference between "œ" and "oe" may just be a typographical option.

Unicode defines certain equivalence tables. A processing program is free to translate incoming text to a connonical form before processing it. The standard requires that such programs remember and honor equivalencies. This poses a problem because two strings may be of different length and yet contain equivalent characters.

The Microsoft .NET framework and its languages like C# have a rather elaborate support for equivalences. The default meaning of a test for "equals" between two strings takes the equivalence rules into consideration. For example, it will report that a string containing "ß" matches another string containing "ss" in the same relative position.

The "Layer" Problem

If the data arrives from a network server over the HTTP protocol, there is provision in the HTTP standard headers to declare a character set and encoding in the "charset" attribute of the Content-Type header. For example:

Content-Type: text/xml; charset=utf-8

This, however, assumes that the Web Server knows what encoding to declare. The problem here is that character set and encoding are not an attribute of even modern disk file systems. There is no version of Unix or Windows that will, in any standard way, disclose if a file contains ISO 8859-1 or ISO 8859-2 byte codes.

In the United States this mostly doesn't matter. Our text files generally contain only characters from the universal ASCII subset. Such data is correctly interpreted when treated as any of the ISO 8859 family (because the first half of every standard eight bit code is ASCII), or as UTF-8 (because the ASCII characters are encoded as single byte values in UTF-8). For that matter, they happen to display correctly if declared to be in Japanese EUC because, again, legacy multi-byte codes generally reserve bytes with values 0-127 to be plain ASCII text.

The simplest solution in foreign countries is to adopt one standard character encoding and store all files on the same machine in the same encoding format. Then the Web Server can be configured with a single Content-Type that applies to all files.

However, the HTML and XML standards have adopted a rather peculiar convention that needs to be discussed. They allow the encoding to be specified at the beginning of the data itself. Let me quote a section of the W3C note on Internationalization:

The document character set for XML and HTML 4.0 is Unicode (a.k.a. ISO 10646). This means that HTML browsers and XML processors should behave as if they used Unicode internally. But it doesn't mean that documents have to be transmitted in Unicode. As long as client and server agree on the encoding, they can use any encoding that can be converted to Unicode.

It is very important that the character encoding of any XML or (X)HTML document is clearly labeled . This can be done in the following ways:
  • Use the 'charset' parameter in the Content-Type header of HTTP . Example:
    Content-Type: text/html; charset=EUC-JP
  • For XML, use the encoding pseudo-attribute in the xml declaration at the start of a document or the text declaration at the start of an entity. Example:
    <?xml version="1.0" encoding="iso-8859-1" ?>
     
  • For HTML, use the <meta> tag inside <head>. Example:
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" >

    For XHTML, you need a slash at the end:

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

With this information, clients can easily map these encodings to Unicode. In practice, a few encodings will be preferred, most likely: ISO-8859-1 (Latin-1), US-ASCII , UTF-8 , UTF-16 , the other encodings in the ISO-8859 series, iso-2022-jp , euc-kr , and so on.

The problem with this rule is that it is circular:

  1. To learn how to convert bytes to characters, you have to parse some tags
  2. To parse the tags you have to find significant characters like "<", ">", and "="
  3. To find the characters, you have to decode the bytes
  4. Go to 1

There is a trick that gets you out of the infinite loop, but it works only with careful planning. The <?xml ?> construct must come first in the file. The <meta> tags can come anywhere in the <header>. You have to ensure that the http-equiv tags come first, because other <meta> tags could legally contain attribute values that use national language characters.

Don't begin HTML files with big comments. Such comments could easily contain national characters, and they cannot be parsed until the encoding is determined. If you want to comment the page, put the comment and anything else that could contain special characters after the <meta> tag.

If this rule is followed, then all the characters up to the encoding declaration are going to be ASCII. All of the encodings except UTF-16 store ASCII characters one byte per character. A file containing UTF-16 characters must begin with a special sequence of byte values (the Byte Order Mark or BOM) that distinguish it from eight bit character files.

If the file is transmitted over HTTP and contains a "charset" in the Content-Type header, use that value. If not, scan the first bytes of the data for a UTF-16 BOM pattern. If that is not found, then parse the beginning of the file in ASCII looking for the "<?xml ?>" or "<meta>" tags.

If none of these techniques yields a clear encoding option, a processor will have to apply a default. Ten years ago the recognized default was ISO 8859-1. Today the Internet is used throughout the world, and the modern standards tend to prefer UTF-8. However, rather than change the default, the current standards document say explicitly that there is no officially recognized default. If no encoding has been specified, the document is in error. Of course, this means that most US Web content is technically in error, but that is not a problem because ASCII characters can be processed correctly by any choice of encoding. So the choice of default encoding is left up to the user.

Communication programming is generally done in layers. The lowest layer forms bits and bytes. The highest layer parses the content of messages.

Even when reading data from a file, modern programming languages generally assign character encoding to the I/O library and the open file object. The program receives a sequence of characters that it can parse for content. Given the underlying buffering of data, many systems do not allow the character encoding option to be changed part way through the reading of the file.

If the file were on disk, it would be a simple matter to open it once with ASCII encoding, read part way through to get the embedded encoding information, close the file, and then reopen it with the specified encoding form. Reading the start of the file twice may be somewhat inefficient, but not terribly so.

However, if the data is being transmitted over a network, then closing and reopening the network connection is somewhere between a really bad idea and a complete disaster. The XML file you are reading may be the result of a database query that took minutes to execute, and you don't want to run the query again. You might read the data in to a stream of bytes, store them in memory, and then dummy files over the stored byte array. This works for small files, but becomes progressively less attractive as the size of the file grows.

In the most common network case, where the data is transmitted over HTTP protocol, then the problem is not new and has a solution. First, the HTTP protocol always starts with ASCII data (the HTTP request/response and headers) and then converts to some other data type and some other encoding in midstream. So modern software must be flexible enough to switch not just encoding but also MIME type in the middle of a file. Also, the HTTP server can disclose the encoding of any text data in the Content-Type header itself. So network processing is only a serious problem when the file is fetched using FTP or one of the other protocols.

Another consequence of this convention is that a middleware product, like a portal, proxy, or gateway, cannot simply read in an HTML or XML file and then blindly forward the data on to a client. A single unedited file can be retransmitted, but only if handled as a byte stream throughout or if the characters are retransmitted in the same encoding that they were received. If it is possible that the characters will be encoded differently from the original data, then the middleware must parse the file to find the <meta> or <?xml ?> markup and change the charset or encoding value to match the properties of the new output stream.

There are solutions to the processing of encoded character data, but they are not clean or simple. The biggest part of the problem is that the standard and documentation don't clearly warn the programmer that the problem exists and has to be solved.

Conclusion

A human editor acquires information from various sources and manually combines them to form a final document. A Portal program acquires information from various sources and programmatically combines them to form a composite document. A Gateway acquires data in one format (say XML) and converts it to another format (say HTML). As long as the data is all produced and consumed within the US there is no problem. As the data moves across borders and languages, it becomes necessary to understand all aspects of the character handling problem.

Start by using a language that supports Unicode character strings and localization rules. Java or the Microsoft .NET Framework are good choices. To solve the widest range of problems, it is probably a good idea to adopt UTF-8 as the output character set. These choices allow the program to handle any type of character from any source and to generate all possible characters in the output. However, that is only part of the solution.

First, you have to get the data in. Data from different sources may be encoded in one of the 8859 eight bit character sets, one of the Windows vendor-specific character sets, or one of the legacy Far East multi-byte codes. The modern programming environment will convert any of these character sets to Unicode values that can be processed by the program. However, the program logic should not assume that the proper encoding will be declared by each source. It may be necessary to probe the data to determine the proper processing, and in some cases you may just have to know or guess an encoding that the source used without specifying.

This leaves the two HTML attributes added for International support. The lang attribute must be added to the block of code you generate to allow your Unicode Far East characters to be mapped into the correct choice between Chinese, Japanese, or Korean.

  1. If the input is HTML and there is a lang attribute on the current tag or any enclosing tag, propagate it to the output.
  2. If the input is XML and there is an xml:lang attribute on the current tag or any enclosing tag, then middleware that transforms the XML to HTML should generate a corresponding HTML lang attribute.
  3. If there is no applicable lang or xml:lang tag, but the incoming data has an encoding that strongly implies a language (like Shift-JIS), then middleware may generate the appropriate lang attribute.
  4. If the input is language neutral (UTF-8), and the XML or HTML provide no guide to the language selection, then an editor or middleware could apply a default language selection from external (meta) knowledge of the source of the data. If it comes from Tokyo, then lang="jp" looks like a good guess.

XML is not itself a presentation format, so it doesn't have attributes for "right to left" or alignment. HTML defaults to left to right presentation, and this behavior must be overridden for the correct display of Hebrew or Arabic text.

Now because the HTML has a mixture of Latin and Hebrew/Arabic text, the source is bidirectional. However, if the only Latin content is the markup, then the text displayed on the screen would be entirely Hebrew or Arabic. At that point, the only thing needed for correct display is alignment of the text with the right boundary. That can be accomplished by several different constructions:

  • dir="rtl"
  • align="right"
  • style="text-align:right"
  • style applied from an external CSS stylesheet

Unfortunately, all of these formatting directives could apply to an individual paragraph, or they could be inherited from any enclosing HTML tag all the way up to the body. Furthermore, although a specification of lang="he" (Hebrew language) is strongly suggestive that the text should be right aligned, it doesn't necessarily by itself generate the right presentation.

This creates an enormous complication for any automated process that gathers HTML information from foreign sources, processes it, and then redisplays it. Such middleware may have special difficulties parsing all the ways that text may be identified as RTL in a Web page. One cannot count on Web page authors doing what is currently regarded as best practice to code lang="he" and dir="rtl" in the appropriate tags. Middleware that ignores input stylesheets (intending to replace them with a different presentation style) could have a particular problem when the only appearance of text-align:right is in the stylesheet.

It may be impossible to design a program that can do the right thing to all data from all sources. However, most application process the same data from the same sources over and over again. With an understanding of the problem, a little custom configuration, and some testing after each new source is introduced, it should be possible to develop a solution to any real-world requirement.

Reference

 

 

 

RegExLab.com © 2005 - All Rights Reserved