Copyright © 2003 jsd
The objective is to be able to publish documents on the web and have them look right to as many readers as possible. This is not so easy, because the various browsers are mutually incompatible.
If we confine ourselves to a certain set of “basic” characters (including the Latin alphabet, decimal digits, and basic punctuation), then there is pretty much a consensus on how a computer should represent them for purposes of computation, transmission, and storage.
But often there are reasons to include a larger cast of characters in our documents. We want to do this in a way that works correctly on all the major browsers,1 and alas it does not suffice to make something that looks good just on the machine where it was generated.
The logical way to proceed is by a two-step process. The first step is to figure out what characters (also called glyphs) you want to deal with. At this level of abstraction, the glyphs are identified by (and only by) their names. The second step is to choose an encoding scheme for the glyphs, and of course the matching decoding scheme whereby the text in your document is decoded into a displayable pattern of glyphs.
The PostScript language implements the aforementioned two-step process quite explicitly. The setup process (1) loads a set of named glyphs and (2) creates a lookup table that defines how the “show” command maps bytes to glyphs.
Adobe, the originator of PostScript, is also a major font vendor, so they play a big role in deciding what names go with what glyphs.
Since we are primarily discussing web documents we won’t go into details about PostScript.
The STIX consortium is busy making and collecting glyphs, and assigning names to them. See reference 1.
HTML of course supports the plain Latin alphabet, digits, and basic punctuation. In addition, the HTML powers-that-be have selected some additional entities, including some accented characters and some mathematical operators. These are “officially” part of HTML, so any browser than can render these symbols really ought to do so.
For example “lambda” is the name of the Greek letter λ, while “prop” is the name of the mathematical proportional-to operator.
As for the second step, HTML defines several different high-level encodings for these entities. The simplest high-level encoding refers to the characters by name: If you utter the character-sequence ∝ in an HTML document, the glyph for the proportional-to operator will be produced.
In addition, HTML assigns numbers to all the glyphs. You can use these numbers to form high-level encodings, using either decimal or hexadecimal numerals. For example, lambda can be represented by λ (decimal) or equivalently by λ (hex).
Perhaps paradoxically, the most-common characters such as “A” and “!” are not assigned names by the HTML spec. You can refer to them by numeric encoding, but not by name. For example, “A” can be represented by A or by A but not by &A; or anything like that. Of course it can also be represented at the low level by the bare character.
The situation is just the reverse for a couple of special symbols, namely ampersand and less-than. If you want to display the glyph for the less-than symbol, you must refer to it by one of the high-level encodings < or < or < since the bare character “<” is reserved by HTML at the syntactic level.
It is convenient that the number assigned to each and every HTML entity (such as & for ampersand) agrees with the Unicode numbering (section 3.3). The HTML entities are a small subset of the Unicode entities.
Similarly, the ISO-8859-1 characters are a subset of the HTML entities, and are numbered the same.
In this section we discuss low-level encodings, that is, things that happen below the HTML level. A document encoding is also known as the document’s charset.
The venerable 7-bit ASCII encoding is extremely popular. It contains the basic Latin alphabet, digits, and basic punctuation. It shows up as the first half (i.e. the first 128 codes) of all the ISO-8859-* encodings, and many other encodings as well.
Also note that the HTML high-level encoding (such as &65; for “A”) uses this same venerable numbering scheme.
Far and away the majority of web documents use the ISO-8859-1 encoding, which ISO calls the “Western” encoding. Its first half reproduces the 7-bit ASCII encoding. The second half contains some Latin characters with diacritical marks, some currency symbols, and a bit more punctuation. It is reasonably useful for the major languages found on the western edge of Eurasia, but is not complete even in that context. This encoding uses eight bits per character.
There are many other ISO encodings. For example, there is ISO-8859-5. Again its first half is 7-bit ASCII. The second half of contains the Cyrillic letters. Unlike ISO-8859-1, it cannot encode the Latin letters with diacritical marks. It had to get rid of them to make room for the Cyrillic letters without using more than 256 codes.
There are obvious computational advantages to an encoding where each character fits into eight bits. But nowadays the trend is to move away from that.
Encodings that are specialized for one language are usually not very good for other languages. This leads us to consider unicode. It can represent pretty much everything, including Chinese, Cyrillic, math symbols, and more.
Let’s analyze how Unicode fits into the two-step scheme.
(1) As far as I can tell, Unicode does not provide the glyphs nor assign names to glyphs. It merely describes them, in terms like “left-pointing angle bracket”. If you want the glyphs and/or glyph-names, you must find them elsewhere.
(2a) Unicode defines a numbering scheme for the glyphs. Every glyph you can think of has a number. This is pretty much the de-facto standard for numbering things.
(2b) Unicode defines various encoding schemes which encode the aforementioned numbers as bytes in a document. And conversely there are corresponding ways to decode the bytes of your document to produce the desired glyphs.
The most common encoding scheme is called utf-8. It has the interesting property of being a variable-length encoding. The basic Latin characters can be represented in a single 8-bit byte. Greek, Cyrillic, and Chinese characters can be represented in two bytes. Most math symbols can be represented in three bytes.
The principal alternative to utf-8 is utf-16. It represents each and every glyph in two bytes, which is no problem since fewer than 65,000 glyphs have been defined. For instance, the utf-16 code for “A” is just a 16-bit word representing the integer 65. There are two variants of utf-16, depending on which byte of the 16-bit word comes first in your document.
Unicode already exists, and is moderately well supported by the major browsers and operating systems. You can write and publish documents using unicode today, and they will pretty much just work, so long as you stick to the major alphabets of the world, and stick to the most-common math operators. Limitations will be discussed in section 5.6.
A font is a complete collection of type (or glyphs) in a consistent size and style.
For years, Microsoft has had something they call the “symbol font” … but it’s not a font. It’s an encoding; it’s a codepage.
The conjecture is that they use the same mechanism for changing fonts and for changing encodings, and that the “symbol font” was named by somebody too clueless to know the difference between a font and a codepage.
An 8-bit codepage can address at most 256 glyphs. Any normal font contains far, far more than 256 glyphs.
Choosing a font means choosing a style. People don’t switch to the so-called “symbol font” because they want to change the style. Rather they switch because they want to change the encoding.
To say the same thing in different words: If you wish to make a collection of 256 glyphs and point to them using an 8-bit codepage, that’s fine … but please don’t call it a font. Such a collection is an encoding issue, not a font issue.
Some nasty vestiges of the font-versus-encoding confusion still remain. For instance, MSIE (version 6 and below) can represent math operators (such as proportional-to), but does not reliably recognize the unicode or HTML-entity representations. The only way to make sure such symbols look OK under MSIE is to switch to the so-called “symbol font”. That is, you are forced to consider the proportional-to character to be some non-mathematical character, written in a “style” that makes it “look like” a proprtional-to symbol. Obviously such kludges wreak havoc on search utilities and other things that want to pay attention to the meaning of the characters without regard to “style”.
Similar remarks apply to many other characters, as may be able to see from the bottom half of the table in reference 2. Characters encoded in the symbol font are shown in red, side-by-side with the corresponding HTML entity is shown in black. In theory these should be the same (in any browser that implements the symbol font at all), but if you view that table using MSIE you will discover that MSIE does not reliably render the HTML code correctly.
MSIE knows how to render them if they are represented using the symbol font kludge, but does not reliably render the corresponding HTML entity codes. Apparently under some rare circumstances MSIE does recognize the entity codes, but experts are unable to explain when it works and when it doesn’t.
The “symbol-font” situation creates terrible interoperability problems, since standards-compliant browsers, quite understandably, don’t process the font-switching directives the way MSIE does. So if you want to write a document that can be displayed on a range of browsers, you can’t just use the MSIE representation and you can’t just use any standard representation. You must resort to a bizarrely complicated workaround. Specifically, here is the simplest quasi-portable representation I know for the proportional-to operator:
<!--[if gte IE 3]><font face=Symbol>µ</font><![endif] --><![if lt IE 3]>∝<![endif]>
As demonstrated in reference 3, by using this workaround you can produce documents that look OK on both MSIE and Mozilla. This covers most but not all of the official HTML entities. Exceptions will be discussed in section 5.
Naive logic suggests that when a browser wants to display a character, it should find the encoding of that character and then look up that codepoint in the “current” font.
That leaves us with the question of what to do if the “current” font doesn’t have a usable glyph at this codepoint. Well, the Firefox browser does something clever: It looks through some list of fonts to see if there is some other font that does have a usable glyph at this codepoint, and if so, it uses that.
This can be confusing, because it means that if you see a glyph on your screen, you don’t know whether it came from your “current” font … and indeed you have no way (as far as I know) to tell what font it did come from.
Don’t blame the browser for problems inherent in the font itself. For instance, the Times New Roman font from Abisource doesn’t contain the glyphs for N-ary Summation or N-ary Product, so if you choose such a font some browsers will have no way of displaying the desired glyph. If you switch to the Times font from Adobe the problem goes away.
There is however some tricky business here: If the “current” font does not provide a usable glyph at a given codepoint, Firefox (and its siblings) will search all fonts on the system (in who-knows-what order) until it finds a usable glyph at that codepoint. This increases the chance that the browser will be able to display something useful, but it means that you don’t necessarily know which font the displayed glyph came from.
MSIE does not render most of the entities classified as “NEW” in reference 4.
MSIE renders the HTML code ↑ or ↑ as an upward arrow, as it should. However, one would also expect an upward arrow in position 173 of the so-called symbol font ... but it’s not there. This is a pain because the symbol-font arrows are much better looking than the HTML arrows, and it would have been nice to generate them whenever possible.
There are weird context-dependent bugs in MSIE. Suppose you want to
produce following string of symbols:
It is symmetric; characters to the right of the product symbol should be identical to the corresponding characters on the left. But if you use MSIE if you code the sequence in the obvious way, it will sometimes be rendered wrongly, as you can see by viewing this:
using MSIE. It may switch to the wrong font to render the product symbol, and then stay in the wrong font for some hard-to-predict length of time. This only happens under some circumstances, notably if the document uses utf-8 encoding and uses the default font selection (i.e. the document doesn’t specify its own fonts via a style sheet or whatever).
Opera appears to be remarkably well-behaved and standards-compliant.
Mozilla is mostly well-behaved and standards-compliant. One slight drawback is that it apparently it coerces all right quotes and left quotes to be symmetrical vertical tick-marks. I can live with that, although it’s not maximally beautiful or elegant.
Netscape version 4.8 (and presumably earlier versions) imposes a horrible dilemma:
Furthermore, Netscape refuses to recognize many of the HTML entities by name when they appear in a Unicode-encoded document. You can work around this by re-encoding the named entities in their decimal or hex representation. Appalling, but true.
BTW note in utf-8 documents, Netscape refuses to draw any form of em-dash or en-dash. Mostly I avoid using these entities anyway, because usually it is possible to get the desired effect by stringing hyphens together.
Also note that Netscape is very slow when processing utf-8 documents, much slower than other browsers.
I’ve given up on symbol compatibility for Netscape. It has horrible non-standard behavior in non-utf mode, and markedly different yet still horrible non-standard behavior in utf mode. I’m not going to lose sleep over this, because Netscape is a pretty old browser and only a few people are still using it.
If you are writing in one particular language, there is almost certainly an encoding optimized for you. For example, there are at least 20 different Cyrillic encodings, each using one byte per character, each optimizing certain details at the expense of others.
If you are writing in a non-alphabetic language such as Chinese, there are encodings optimized for that, too, using two bytes per character.
Things get much more interesting if you need to handle more than one language at once. Imagine for instance a browser that wants to show you a synopsis of each file in a directory, when the files are in many different languages. In such a situation unicode looks pretty attractive. Computer infrastucture (operating systems, editors, browsers, search engines) appear to be evolving towards relying on unicode.
If you are writing HTML documents that are mostly in one language, with occasional snippets of math or other languages, a reasonable option is to use your whatever encoding is “normal” for your locale, supplemented by the HTML high-level encoding if/when there’s something your low-level encoding can’t handle. You must stick in the kludgey workarounds when necessary to deal with the MSIE issue.
If you find that no matter what 8-bit encoding you choose, a significant fraction (more than a few percent) of the characters in your document are outside that encoding, you have nothing to lose and everything to gain by switching to unicode.
For documents written mainly in English you can easily confine yourself to using the 7-bit ASCII low level encoding, and represent everything else using the HTML high-level encoding. Having done that, you can (in theory at least) make an arbitrary choice as to what you say is “the” low-level encoding of your document: you could call it utf-8, ISO-8859-1, ISO-8859-5, etc., and it shouldn’t make any difference.
Sometimes the practice differs from the theory. For example, Browsers such as Mozilla and Netscape allow (indeed require) the user to set certain preferences separately for each encoding. These preferences include things like default font-family and font-size. This is relevant because the ISO-8859-1 is so exceedingly common that many users have no idea that other encodings exist, and have not set up suitable preferences for them. The result is that when they visit a non-ISO-8859-1 page, it looks “funny” but they don’t know why. So this is an argument against declaring your document to be anything other than ISO-8859-1 if/when it could just as well be ISO-8859-1.
For this reason alone, if you have a choice between ISO-8859-1 or something else, I suggest sticking with ISO-8859-1, just to avoid rocking the boat.
You can observe the encoding-dependent behavior of Netscape and/or Mozilla by comparing reference 5 against reference 2. Those two documents are identical except for their charset declarations, and they do not contain any bytes outside the 7-bit ASCII range, where utf-8 is byte-for-byte identical to ISO-8859-1.
The standard default encoding on the world-wide-web is ISO-8859-1.
If you decide you want some other encoding, there are two things you need to do. First, the HTML file ought to specify its content-encoding in its “head” section. For example, if you want the utf-8 encoding, you say:
<HEAD> ... <META http-equiv="Content-Type" content="text/html; charset=utf-8"> ... </HEAD>
Secondly, you should teach whatever server is serving that file to send the correct content-type in the HTTP header. Note that the HTTP header is conceptually different from the HTML head section. Also note that web servers do not generally read the HTML document to discover META directives. One correct approach, assuming you have an Apache server, is to put a line like this in the .htaccess file:
AddType text/html;charset=UTF-8 html htm
That tells the server that all .html and .htm files in the directory should be treated as utf-8. (If you work harder you can select which files to get which treatment.)
Very weird things happen with some browsers if the encoding in the HTTP header is contradicting by a META directive in the HTML head.
The most widely-available HTML editor is word-for-windows (aka winword). It can read in an HTML file, edit it, and write it back out again. Alas it doesn’t do a very nice job. If it reads in a file with utf-8 encoding, it will write it back out in windows-1252 encoding. That destroys any hope of compatibility or portability, because there are lots of people who don’t run windows and the windows-1252 encoding means nothing to them.
Winword also makes heavy use of the abominable symbol-font encoding. It does this even for symbols such as the copyright symbol for which widely-compatible simple representations are available. In fact winword uses 323 bytes to represent even ultra-simple things like the copyright symbol.
I write most of my documents in the LaTeX language. That means I can convert them to .pdf format using pdflatex, and/or convert them to HTML using HeVeA.
I have a postprocessor that I apply to the HTML generated by HeVeA. It catches such things as
... and converts them to the HTML-decimal encoding. It emits the workaround for MSIE gradient and proportional-to operators.
It is relatively easy to get a list of symbols; see e.g. reference 6.
What we need is a list that says what symbols are well handled by which browsers. Reference 2 and especially reference 7 are a start in that direction, but more needs to be done. It’s a real crock to start using a symbol and then discover it can’t be displayed on some systems.
Copyright © 2003 jsd