When Bad Things Happen to Good Characters
Get to Know a Character
It can be useful to know your characters, but more practically useful to know one character well.
My character is an "e" with an acute accent, character code 233 (decimal) in Latin-1 and Unicode.
There are many ways it can be inserted into a document:
- On Windows, I hold down the Alt key and type 0233 on the numeric keyboard and release the Alt key.
I could use the charmap program, too.
Or I could copy and paste it
But entering the code directly is risky because, if the character encoding changes,
e.g., from Latin-1 to UTF-8,
then the meaning of code 233 changes.
- In an HTML document, I can enter these magical incantations,
which are displayed correctly regardless of encoding:
Note: HTML/XHTML validation programs might not be acquainted with these and complain.
- é (decimal) ⇒ é
- é (hex) ⇒ é
- é (mnemonic) ⇒ é
- In Microsoft Word, I type an accent code followed by the accented letter.
On Windows, Ctrl+quote, then 'e'. On Mac, Option+quote, then 'e'.
Accent codes include: grave=backquote, acute=quote, circumflex=hat, colon=umlaut, comma=cedilla, tilde=tilde,
slash=slash, and perhaps others.
What Could Possibly Go Wrong?
If é is UTF-8 encoded, but displayed without decoding, it looks like this:
The first 128 characters in the Latin-1 character set (same as ASCII),
are simply represented as themselves in UTF-8.
The second half of Latin-1 characters are split.
The first half of the non-ASCII Latin-1 characters are represented by themselves, preceded by code 194 decimal
or C2 hex, so the UTF-8 encoding for character code 191 (decimal), ¿, is
The second half of the non-ASCII Latin-1 characters are represented by a different character,
preceded by code 195 decimal or C3 hex.
So, when looking at UTF-8 encodings of Latin-1 characters,
if you see Â or Ã where you do not expect it,
there are probably too many UTF-8 encodings.
Multiple extra encodings have a pattern to them:
Note: If you see boxes in the characters above,
it is because the font used is missing that character.
There is no way to fix it other than getting a new font or by changing the font.
those used elsewhere,
so the "alert", "title", and "status" buttons in the
Character Conversion Corner
can be used to test characters in those contexts.
5 you get the idea
Too few encodings can have a bad effect that looks different.
When é is not UTF-8 encoded, it can appear like this very high numbered character:
Progressive under-encoding can result in a question mark being displayed.
You are now ready to diagnose UTF-8 encoding problems (e.g., with é):
|é || no problems
|Ã© || too much UTF-8 encoding, or viewing UTF-8 encoded text with Latin-1 encoding
|ÃÂ© || much too much UTF-8 encoding
|� || too little UTF-8 encoding
|? || something bad happened to this character
| || wild animals have eaten this character
|𐀓 || if you see a box, the font in use is missing this character.
Firefox 3's boxes contain the hexadecimal value for the missing character,
but it's still just a missing character.
Background Information from Wikipedia