visit
Improved ways of communicating arbitrary messages are called Telegraphy
.
is the long-distance transmission of messages where the sender uses symbolic codes, known to the recipient, rather than a physical exchange of an object bearing the message.
Encoding is a way to convert data from one format to another.
A single unit of textual data is called a character
(or char
in most programming languages). For now, it’s enough to know that char
can be any sign used for creating textual content, such as a letter of the English alphabet, digit, or some other signs like space, comma, exclamation mark, question mark, etc.
A character encoding is a way to convert text data into binary numbers.
Essentially, encoding is a process of assigning unique numeric values to specific characters and converting those numbers into binary language. These binary numbers later can be converted back (or decoded
) to original characters based on their values.
Character set is simply a mapping between binary numbers and characters.
Simply put, character set
is an agreement that defines the correlation between binary numbers and characters from a character set
.
AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz
0123456789
<space>!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz0123456789 !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
To represent 95 characters, we need at least 7 bits, which allows us to define 2^7 = 128
characters.
Most surely you heard of and used many different fonts on a computer. It must be obvious by now, that the character set is only defining what is sent, not how it looks.
When we are referring to a character as “A” or “B”, we have a common understanding of what it means (it’s the Platonic “idea” of a particular character). But we can all read and write the same characters in different variations (shapes).
A font is a collection of glyph definitions, in other words, the shapes (or images) that are associated with the character they represent.
You can easily learn ASCII code for the character A
, character a
and character 0
because they are arranged in such an elegant way:
A
starts with 1, followed by all zeros and 1 (as the first letter in the alphabet) at the end: 100 0001a
- just take A
1000001 and flip the second digit to 1, you get 110 0001
Character | Binary representation | Rule |
---|---|---|
A | 100 0001 | Starts with 1, has all zeros, and 1 at the end |
a | 110 0001 | Starts with 11, has all zeros and 1 at the end |
0 | 010 0000 | Starts with 01, has all zeros |
Some would be BS
(backspace), DEL
(delete), TAB
(horizontal tab).
CR
- Carriage return
LF
- Line feed
So, you would first push the handle on the left side (marked as 1
on the picture) to the right to move the carriage
, hence CR
(Carriage return).
The next step is to “feed” the typewriter with more next line of paper with a knob either on the right end of the carriage (marked as 2
on the picture). Hence, LF
(Line feed).
Operating system | End-of-line notation |
---|---|
Linux |
|
Windows |
|
MAC (up through version 9) |
|
MAC OS X |
|
In most programming languages non-printable characters are represented using so-called escaped
character notation, usually with a backslash character \
.
Character | Escaped |
---|---|
|
|
|
|
|
|
Let’s now try to encode text Hello World!
in ASCII.
Character | Dec | Hex | Binary |
---|---|---|---|
H | 72 | 48 | 100 1000 |
e | 101 | 65 | 110 0101 |
l | 108 | 6c | 110 1100 |
l | 108 | 6c | 110 1100 |
o | 111 | 6f | 110 1111 |
<space> | 32 | 20 | 010 0000 |
W | 87 | 57 | 101 0111 |
o | 111 | 6f | 110 1111 |
r | 114 | 72 | 111 0010 |
l | 108 | 6c | 110 1100 |
d | 100 | 64 | 110 0100 |
! | 33 | 21 | 010 0001 |
H e l l o <space> W o r l d !
01001000 01100101 01101100 01101100 01101111 00100000 01010111 01101111 01110010 01101100 01100100 00100001
Encodings that are using same mappings for ASCII printable characters are called Extended ASCII and are commonly referred to as Code pages
. They guarantee that all ASCII encoded files can be processed with these extended encodings.
One popular 8-bit encoding scheme that was not ASCII compatible is EBCDIC, which was used on proprietary IBM PCs.
Instead of just one, we have a huge number of standardized code pages
. A full list of code pages can be found on Wikipedia:
(or ANSI-1252) - default code page for legacy Windows OS
(also known as OEM 437, or DOS Latin the US) - an original code page for IBM PC.
, aka Latin-1 (also useful for any Western European language)
If not explicitly set, ISO 8859–1 was the default encoding of the document delivered via HTTP with the MIME Type beginning with text/
. Since HTML5 this was changed to UTF-8.
For most European languages this was acceptable, but not a great solution. You could not for example write in multiple languages in the same textual file, because you can use only one code page while processing text.
A bigger issue than that was the lack of support for languages that have much more characters than available 128 spaces within ASCII extended 8-bit code pages. Some examples are Arabic, Hindu, and Chinese (which have more than 10 thousand symbols called ideograms, which are actual words rather than letters as we are used to in European languages for example).
In a response to all the problems “code pages” had been introduced, and a new standard was initiated named . It was an attempt to make a huge single character set for all spoken languages and even some made-up ones and other signs such as emojis 🎉. The first version came out in 1991 and had many new versions since then, the latest one being in 2021 at the time of this writing. The Unicode Consortium also maintains the standard for UTF (Unicode Transformation Format) encodings. More on this later. But first…
In simple words, the letter A is just encoded with some binary number (0100 0001 in ASCII).
Diacritic - extra decoration for the character
Grammar of some languages requires the use of diacritics for letters when certain conditions are met. Therefore, term character
became ambiguous, so a new term is adopted for describing written symbols that could be with or without diacritics - a grapheme
.
Grapheme is a single unit of human writing system. It may consist of one or more code points.
As there can be a huge number of combinations of letters and their possible modifiers, instead of making an encoding schema for all of them, it’s much more efficient to encode them separately. Therefore we separate grapheme in code points.
Code point is any written symbol or it’s modifier like diacritic for letters, or even skin color of emojis
So, one or more code points can make up a grapheme. Therefore, the Unicode character set is defined as a set of code points rather than a set of graphemes.
Some people believe that Unicode is just a simple 16-bit code where each character is mapped to a 16-bit number and so there are 2^16 = 65 536
possible code points. That is not the case and there are 144 697 defined characters at the time of writing this article.
What is true is that all characters that can fit into 2 bytes, in other words, 2^16 = 65 536
code points are considered to make up BMP - basic multilingual plane (U+0000
to U+FFFF
). This was the first attempt to unite all code points needed for storing textual data, but soon it became obvious that it needed even more space, so 2 bytes were not sufficient anymore.
Let’s go back to É for a moment. This particular symbol can be encoded in two ways:
U+XXYY
U+
stands for Unicode
XXYY
are bytes expressed in hexadecimal numbers (can be two or more)
If we go back to É described as 2 code points, we can easily track the first code point as hexadecimal 65
, which is indeed the letter E in the ASCII table.
Some graphemes have more than 2 bytes. For example thumbs up emoji 👍 has the notation: U+1F44D
.
Let’s go back to our example from the beginning of this article with a simple change: let’s add emoji 🚀 between world
and !
:
Hello world
🚀!
Character | Hex |
---|---|
H | 00 00 00 48 |
e | 00 00 00 65 |
l | 00 00 00 6c |
l | 00 00 00 6c |
o | 00 00 00 6f |
<space> | 00 00 00 20 |
W | 00 00 00 57 |
o | 00 00 00 6f |
r | 00 00 00 72 |
l | 00 00 00 6c |
d | 00 00 00 64 |
🚀 | 00 01 F4 4D |
! | 00 00 00 21 |
Remember BMP (Basic Multilingual Plane)? We said that all characters that fit into 2 bytes are considered to be part of BMP. UCS-2 was exactly that - 2 bytes per code-point and nothing more!
Code point | Binary value |
ASCII |
UCS-2 |
UTF-32 |
---|---|---|---|---|
E | 01000101 | ✅ | ✅ | ✅ |
Φ | 00000011 10100110 | ❌ | ✅ | ✅ |
🚀 | 00000000 00000001 11110110 10000000 | ❌ | ❌ | ✅ |
The mechanism it uses is called surrogate pairs.
Let’s take emoji like 🚀 that has Unicode value U+1F44D
.
00000000 00000001 11110110 10000000
Now, we see that it goes over 16 bits in size. To represent a character of more than 16 bits in size, we need to use a “surrogate pair”, with which we get a single supplementary character. The first (high) surrogate is a 16-bit code value in the range U+D800
to U+DBFF
, and the second (low) surrogate is a 16-bit code value in the range U+DC00
to U+DFFF
.
High surrogate format: 110110XX XXXXXXXX
Low surrogate format: 110111XX XXXXXXXX
Now, we need to subtract 1 00000000 00000000
from the binary representation of emoji.
00000000 00000001 11110110 10000000
- 00000000 00000001 00000000 00000000
= 00000000 00000000 11110110 10000000
0000 11110110 10000000
And replace X
signs in high and low surrogate with bits we just calculated:
0000 11110110 10000000
// Split in half:
00 00111101
10 10000000
// Replace Xs in High and Low surrogate
High surrogate mask: 110110XX XXXXXXXX
Low surrogate mask: 110111XX XXXXXXXX
High surrogate value: 11011000 00111101
Low surrogate value: 11011110 10000000
11011000 00111101 11011110 10000000
High surrogate unavailable range: 11011000 00000000 (D800) to 11011011 11111111 (DBFF)
Low surrogate unavailable range: 11011100 00000000 (DC00) to 11011111 11111111 (DFFF)
Surrogate unavailable range: 11011000 00000000 (D800) to 11011111 11111111 (DFFF)
In this range, there are no code points, in other words, this hole makes 2^11 = 2048
unavailable code points.
Here we can see basic info, like:
As you can see, we successfully calculated UTF-16 encoding, yeah 🥳!
But wait, you might ask what are now these 2 variations of UTF-16 called UTF-16BE
and UTF-16LE
🤔? That brings us to the next topic…..
The order of the bytes for multi-byte in which they are stored in memory is called Endianness
of the computer system. Depending on the place of the MSB
(Most Significant Byte) and LSB
(Least Significant Byte) there are:
BE
- Big-endian (MSB
is stored at the smallest memory address)LE
- Little-endian (LSB
is stored at the smallest memory address)
CPU usually does not take one byte when processing data, but it takes multiple bytes. This measure is called word
in CPU terminology. It becomes natural that the size word
is multiple of 8 since the smallest unit for storage is byte
(8 bits). Modern CPUs are 32-bit or 64-bit in size.
Most modern computer systems (Intel processors for example) use a little-endian format to store the data in the memory. The reason is beyond the scope of this article, but it’s related to the internal structure of the CPU since particular endianness allows for certain features on different CPU designs.
If we have 32-bit integer number like 42
for example, we would write it in binary format as:
// Big Endian
MSB LSB
0000 0000 | 0000 0000 | 0000 0000 | 0010 1010 <- binary representation
0x00 | 0x01 | 0x02 | 0x03 <- memory address
// Little Endian
LSB MSB
0010 1010 | 0000 0000 | 0000 0000 | 0000 0000 <- binary representation
0x00 | 0x01 | 0x02 | 0x03 <- memory address
For our example, instead of reading 42
, we could by mistake think it’s 704 643 072
!
All Unicode encodings use at least 2 bytes of data per code point, meaning that the CPU is storing multiple bytes at once in either BE
or LE
.
A little trick we can use to make sure proper endianness is applied when reading files written in Unicode encodings is the so-called Byte Order Mark or BOM for short.
BOM is nothing more than a code point that has a special property.
Its Unicode value is U+FEFF
and it represents “zero-width, non-breaking space”.
But, the catch is that if we reverse the order of the bytes to U+FFFE
we get to a value that is considered invalid for UTF-16 encoding. Hence, the text processor understands that it needs to read the bytes in a different order, using Little Endian.
Now, to demonstrate this, I will show how it looks when saving characters abcde
in a file using UTF-16 LE and UTF-16 BE:
// When saved as UTF-16 LE it will keep LSB on lowest memory address
FF FE 61 00 62 00 63 00 64 00 65 00
// When saved as UTF-16 BE it will keep LSB on lowest memory address
FF FE 61 00 62 00 63 00 64 00 65 00
If BOM is not set for UTF-16, it is assumed that Big endian is used.
What this encoding allows as a bonus is backward compatibility with ASCII encoded files, as all ASCII characters would be read properly. This is not the case with UTF-32
, UTF-16
, and UCS-2
encoding schemes, as they expect the exact number of bytes per code point (4 for UTF-32
, 2 for UCS-2
and 2-4 for UTF-16
).
10
.
Character A
falls in a range of ASCII characters. Therefore, it’s encoded using ASCII encoding like this:
01000001
Greek letter Φ has a Unicode code point U+03A6
.
00000011 10100110
// Number of ones == number of bytes, then 0
110XXXXX
10XXXXXX
As our code point from example, Φ can fit in 10 binary digits, 2 bytes is enough for this code point.
11001110 10100110
Or, in hexadecimal, we get CE A6
.
00000000 00000001 11110110 10000000
1110XXXX 10XXXXXX 10XXXXXX
11110XXX 10XXXXXX 10XXXXXX 10XXXXXX
This gives us 3x6+3 = 21
bits of space for Unicode code points.
11110000 10011111 10011010 10000000
Which is F0 9F 9A 80
in hexadecimal. You can verify it here:
Bits of code point | First code point | Last code point | Bytes in sequence | Byte 1 | Byte 2 | Byte 3 | Byte 4 | Byte 5 | Byte 6 |
---|---|---|---|---|---|---|---|---|---|
7 | U+0000 | U+007F | 1 | 0xxxxxxx |
|
|
|
|
|
11 | U+0080 | U+07FF | 2 | 110xxxxx | 10xxxxxx |
|
|
|
|
16 | U+0800 | U+FFFF | 3 | 1110xxxx | 10xxxxxx | 10xxxxxx |
|
|
|
21 | U+10000 | U+1FFFFF | 4 | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
|
|
26 | U+200000 | U+3FFFFFF | 5 | 111110xx | 10xxxxxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
|
31 | U+400000 | U+7FFFFFFF | 6 | 1111110x | 10xxxxxx | 10xxxxxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
We know that BOM code point is FE FF
. Binary form is:
11111110 11111111
We already know that the first byte must start with several ones that represent how many bytes this code point requires by UTF-8 encoding (in this case 3), followed by zero. For other bytes, they must start with 10
:
// First byte mask
1110xxxx
// Other bytes masks
10xxxxxx
10xxxxxx
So, complete mask for our FE FF
code point is:
1110xxxx 10xxxxxx 10xxxxxx
Now what is left is to populate x
signs with bits themselves. We finally get:
11101111 10111011 10111111
EF BB BF
If file is saved in UTF-8 with BOM
encoding scheme, it’s first three bytes will be:
// Big Endian
EF BB BF
// Little Endian
BF BB EF
Here is another demo of a file saved in UTF-8 BOM
encoding:
In most tutorials for beginners in any programming language, you learn how to print out the console famous “Hello world!” sentence.
Programming language | Default character encoding | Unicode aware |
---|---|---|
C/C++ | ASCII | ❌ |
Java | UTF-16 (first versions used UCS-2) | ✅ |
C# | UTF-16 | ✅ |
Javascript | UTF-16 | ✅ |
PHP | ASCII | ❌ |
Python 2 | ASCII | ❌ |
Python 3 | UTF-8 | ✅ |
The key point to take away from here is that you should be aware of the limitations your programming language has when processing data, otherwise you can get in all kinds of funny situations. For example, counting the number of characters in Unicode text that has at least one character outside the ASCII range using C++ char
datatype would give you the wrong result.
For example, there is a case of MySQL server that has an encoding scheme called “utf8”, which is just an alias for “utfmb3”. What it represents is UTF-8 with Maximum Bytes 3.
As shown in section “UTF-8 encoding summary”, we can see that with 3 bytes you can store only a range from U+0000
up to U+FFFF
, so-called BMP (Basic Multilingual Plane).
In MySQL, the encoding scheme “utf8” cannot store Unicode code points outside Basic Multilingual Plane!
Content-Type: text/plain; charset="UTF-8"
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
...
meta tag for declaring charset in HTML should be the first thing in <head> tag, for 2 reasons:
You may ask what happens if there is no charset defined in HTML? Did the browser try to guess the encoding, or it will just use some default encoding scheme? Well, browsers do try to guess the encoding based on statistics on how many times particular characters appear in a text. As you may assume, it was not very successful in achieving a good result.