ASCII, Unicode, and UTF-8
A computer does not know what the letter "A" is. To store text, we must assign a number to every character we want to use. This mapping system is called a Character Encoding.
The Early Days: ASCII
In the 1960s, the American Standard Code for Information Interchange (ASCII) was created. It was a 7-bit code, meaning it could represent 128 characters ($2^7$).
- 0-31: Control codes (like "Enter" or "Tab").
- 32-127: Printable characters (A-Z, a-z, 0-9, punctuation).
For example, in ASCII:
A= 65a= 97?= 63
The Limitation
ASCII was great for English speakers in the USA. It was terrible for everyone else. It had no accents (é, ñ, ü), no Greek letters, and certainly no Chinese or Japanese characters.
The Chaos: "Extended ASCII"
To fix this, different countries invented their own 8-bit systems. They used the "extra" 128 slots available in an 8-bit byte (128-255) to store their local characters.
- In Western Europe, byte
200might mean È. - In Russia, byte
200might mean И.
If you opened a text file sent from Russia on a French computer, the text would look like garbage (called mojibake). There was no single standard.
The Solution: Unicode
Unicode is not an encoding; it is a giant list (a map). The goal of Unicode is to assign a unique number (Code Point) to every character in every human language, plus symbols and emojis.
Ais U+0041💩(Pile of Poo) is U+1F4A9
Unicode defines what the number is, but not how to save it to a file. That is where UTF-8 comes in.
UTF-8: The King of Encodings
UTF-8 is a way to save Unicode numbers as binary bytes. It is smart and variable-length.
- Backward Compatible: If a character is standard English (ASCII), UTF-8 uses just 1 byte. This means old ASCII files are also valid UTF-8 files.
- Efficient: Common characters use less space.
- Expansive: Rare characters (like Emojis or ancient hieroglyphs) use 3 or 4 bytes.
Today, over 98% of the web uses UTF-8. It is the default for almost every modern programming language, database, and text editor.
Common Mistake: If you see "weird characters" like
at the start of a file, that is a BOM (Byte Order Mark). It's a hidden signal some text editors add to say "This is UTF-8." Modern tools usually don't need it, and it can break scripts.