Text & Encodings • 8 min read • Last updated: Tue Jan 16 2024 00:00:00 GMT+0000 (Coordinated Universal Time)

ASCII, Unicode, and UTF-8

A computer does not know what the letter "A" is. To store text, we must assign a number to every character we want to use. This mapping system is called a Character Encoding.

The Early Days: ASCII

In the 1960s, the American Standard Code for Information Interchange (ASCII) was created. It was a 7-bit code, meaning it could represent 128 characters ($2^7$).

0-31: Control codes (like "Enter" or "Tab").
32-127: Printable characters (A-Z, a-z, 0-9, punctuation).

For example, in ASCII:

A = 65
a = 97
? = 63

The Limitation

ASCII was great for English speakers in the USA. It was terrible for everyone else. It had no accents (é, ñ, ü), no Greek letters, and certainly no Chinese or Japanese characters.

The Chaos: "Extended ASCII"

To fix this, different countries invented their own 8-bit systems. They used the "extra" 128 slots available in an 8-bit byte (128-255) to store their local characters.

In Western Europe, byte 200 might mean È.
In Russia, byte 200 might mean И.

If you opened a text file sent from Russia on a French computer, the text would look like garbage (called mojibake). There was no single standard.

The Solution: Unicode

Unicode is not an encoding; it is a giant list (a map). The goal of Unicode is to assign a unique number (Code Point) to every character in every human language, plus symbols and emojis.

A is U+0041
💩 (Pile of Poo) is U+1F4A9

Unicode defines what the number is, but not how to save it to a file. That is where UTF-8 comes in.

UTF-8: The King of Encodings

UTF-8 is a way to save Unicode numbers as binary bytes. It is smart and variable-length.

Backward Compatible: If a character is standard English (ASCII), UTF-8 uses just 1 byte. This means old ASCII files are also valid UTF-8 files.
Efficient: Common characters use less space.
Expansive: Rare characters (like Emojis or ancient hieroglyphs) use 3 or 4 bytes.

Today, over 98% of the web uses UTF-8. It is the default for almost every modern programming language, database, and text editor.

Common Mistake: If you see "weird characters" like ï»¿ at the start of a file, that is a BOM (Byte Order Mark). It's a hidden signal some text editors add to say "This is UTF-8." Modern tools usually don't need it, and it can break scripts.