UTF-8
3 min readWhat is UTF-8?#
- UTF is an encoding standard to display raw things from computer to human.
- Firstly, when u store anything on ur computer via disc/ram/anything in terms of text or video or image etc, all these will be stored in binary(1s or 0s) under the hood.
- Each of these binaries are tagged with each letter, for EG:
"H" - 72(in decimal) 1001000(in binary)
"E" - 69(in decimal) 1000101(in binary)
etcc
- Like the above, the entire letters(english, japanese, tamil etc), symbols, everything has been
mapped to some unique number, this is called
Unicode. - This
Unicodeis created, primarily so that humans can view raw text, instead of plain binary, instead of showing the binary values to the user, there are some mapped ids to each letters and then those letters will be showed to the user(notepad, vscode, browser etccc). - This conversion of binary to unicode is often called as UTF-8.
- There are several encodings like UTF-8, UTF-16, UTF-32, others as well.
NOTE:
Initially there was only ASCII, where it represents only english alphabets and symbols
with total 128 letters, later we included every language and every language symbol and
placed into a common thing called Unicode.
UTF representation#
Initially when ASCII was the only standard, UTF was represented with 1byte(8bits), but mostly
7bits are used.

Once Unicode has been introduced, there are some characters or emoji that took more than 8bits,
in those scenarios, some guys did some simple wonderful thing like, if u remember in previous ascii
only 7bits are used in the total of 8bits, so the left over bit is used to mention
whether its ascii or newer one, if its ascii it would be 0, if its newer it would be 1.
EG
Lets take this emoji example: 😀 -> U+1F600(unicode id)
decimal -> 128512
binary -> 0001 1111 0110 0000 0000
NOTE: As i told u utf will represent 1st bit in each byte either 0 or 1
also utf needs to know how long the bytes would take to complete, for this
the first byte will represent total bytes, for eg:
if letter is 2bytes it would represent
110xxxxx 10xxxxxx
if letter is 1byte it would represent
10xxxxxx(non-ascii)
0xxxxxxx(ascii)
if letter is 3byte it would represent
1110xxxx 10xxxxxx 10xxxxxx
if letter is 4byte it would represent
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
for our above example 128512 would fit in 4bytes
binary -> 0001 1111 0110 0000 0000
it will be filled from right to left
from right most to left 1st6bits(000000) would fit in the right most below
from right most to left 2st6bits(011000) would fit in the right most below
from right most to left 3rd6bits(011111) would fit in the right most below
11110000 10011111 10011000 10000000