UNICODE

The dictionaries of Agheyisi, Melzian and Thomas as well as the corpus of Thomas contain many special characters. To display these characters the UNICODE UTF-8 standard is used.

Hexadecimal notation

An hexadecimal digit is one of the characters '0',...,'9','A',...,'F'. Each hexadecimal digit denotes a number:


'0'=0 '1'=1 '2'=2 '3'=3 '4'=4 '5'=5 '6'=6 '7'=7 '8'=8 '9'=9


'A'=10 'B'=11 'C'=12 'D'=13 'E'=14 'F'=15

A hexadecimal number is a number written in hexadecimal notation, that is, with base 16. For example:


203F = 2*16*16*16 + 0*16*16 + 3*16 + 14 = 8254

UNICODE

UNICODE is a standard that assigns a code number to a character (wikipedia). Most Unicode characters have a code number between 0 en 65,535 and thus need four hexadecimal digits in hexadecimal notation.

Codenumbers between 0 and 65,535 are usualy written as U+nnnn where n is a hexadecimal digit. For example, that character '‿' (UNDERTIE) has codenumber U+203F (8254 decimal).

An encoding is a mapping from code numbers (which represent characters) to sequences of code units. A code unit is in practice an octet (8-bit byte), a double octet (16-bit quantity), or a quadruple octet (32-bit quantity).

UTF-8

UTF-8 encoding

UTF-8 uses 8-bit code units, and it represents characters in the Basic Latin (ASCII) range U+0000 to U+007F efficiently, one code unit per character. On the other hand, this implies that all other characters use at least two code units, which all have the most significant bit set—i.e., they are in the range 80 to FF (hexadecimal). More exactly, they are in the range 80 to 9F. This means that when there is a code unit in the range 00 to 7F in UTF-8 data, we can know that it represents a Basic Latin character and cannot be part of the representation of some other character.

These structural decisions imply that UTF-8 is relatively inefficient, since it leaves many simple combinations unused. There is yet another principle that has a similar effect. In a representation of any character other than Basic Latin characters, the first (leading) code unit is from a specific range, and all the subsequent (trailing) code units are from a different range.

UTF-8 Encoding Algorithm

For a character outside the Basic Latin block, UTF-8 uses two, three, or four octets. You might encounter specifications that describe UTF-8 as using up to six octets per character, but they reflect definitions that did not restrict the Unicode coding space the way it has now been restricted.

The UTF-8 algorithm is described in Table 6-1. The first column specifies a bit pattern, in 16 or 21 bits, grouped for readability. The other columns indicate how the pattern is mapped to code units (octets), represented here as bit patterns.

Table 6-1. UTF-8 encoding algorithm

Code number in binary	Octet 1	Octet 2	Octet 3	Octet 4
00000000 0xxxxxxx	0xxxxxxx
00000yyy yyxxxxxx	110yyyyy	10xxxxxx
zzzzyyyy yyxxxxxx	1110zzzz	10yyyyyy	10xxxxxx
uuuww zzzzyyyy yyxxxxxx	11110uuu	10wwzzzz	10yyyyyy	10xxxxxx

Thus, the UTF-8 encoding uses bit combinations of very specific types in the octets. If you pick up an octet from UTF-8 encoded data, you can immediately see its role. If the first bit is 0, the octet is a single-octet representation of a (Basic Latin) character. Otherwise, you look at the second bit as well. If it is 0, you know that you have a second, third, or fourth octet of a multioctet representation of a character. Otherwise, you have the first octet of such a representation, and the initial bits 110, 1110, or 1111 reveal whether the representation is two, three, or four octets long.

Thus, interpreting (decoding) UTF-8 is straightforward, too. You take an octet, match it with the patterns in column “Octet 1” in Table 6-1, and read zero to three additional octets accordingly. Then you construct the binary representation of the code number from the bit sequences you extract from the octets. Naturally, nobody wants to do this by hand, but the point is that this can be implemented efficiently, as operations on bit fields. A correct implementation of Unicode has to signal an error, if there is data that does match any of the defined patterns.

A quick way to find out the UTF-8 encoding of a string is to visit http://www.goo gle.com on any modern browser, type the string into the keyword box, and hit Search. Then just look at the address field of the browser. For example, if you type pâté, the address field will contain http://www.google.com/search?hl=en&lr=&q=p%C3%A2t %C3%A9, so you can see that â is encoded as the octets C3 A2 and é as octets C3 A9. (In some situations, this does not work since Google does not use UTF-8. In that case, use the URL http://www.google.com/webhp?ie=UTF-8 to force the input encoding to UTF‑8.)

Some Properties of UTF-8

Due to the algorithm, the octets appearing in UTF-8 are limited to certain ranges, as shown in Table 6-2. In particular, octets C0 and C1 and F5 through FF do not appear in UTF-8. Other octets may appear in specific contexts only. This means that if you have a large file that is not, in fact, character data in UTF-8 and you try to read it as UTF-8, it is most probable that errors will be signaled.

Table 6-2. Octet ranges in UTF-8

Code range	Octet 1	Octet 2	Octet 3	Octet 4
U+0000..U+007F	00..7F
U+0080..U+07FF	C2..DF	80..BF
U+0800..U+0FFF	E0	A0..BF	80..BF
U+1000..U+CFFF	E1..EC	80..BF	80..BF
U+D000..U+D7FF	ED	80..9F	80..BF
U+E000..U+FFFF	EE..EF	80..BF	80..BF
U+10000..U+3FFFF	F0	90..BF	80..BF	80..BF
U+40000..U+FFFFF	F1..F3	80..BF	80..BF	80..BF
U+100000..U+10FFFF	F4	80..8F	80..BF	80..BF

Similarly to UTF-16, UTF-8 makes it impossible to access the nth character of a string directly. UTF-8 is robust, though: if a code unit is corrupted, other characters will be processed correctly. The reason is that UTF-8 has been designed so that a code unit starting the representation of a character can be recognized as such, even if the preceding code unit is in error.

UTF-8 encoding tables

    U+2015	―	e2 80 95	HORIZONTAL BAR

The first column contains the Unicode code number.

The second column contains the character.

The third column contains the UTF-8 code (hex.)

The fourth column contains the name of the character.

Combining characters

The Unicode characters between U+0300 and U+036F are diacritical marks that combine with the preceding Unicode character. A normal character can be followed by several diacritical markers.

ọ́  U+006f U+0301 U+0323

with

U+006F  o   6f     LATIN SMALL LETTER O

U+0301   ́   cc 81  COMBINING ACUTE ACCENT

U+0323   ̣   cc a3  COMBINING DOT BELOW

Not all display software will display such characters correctly. The diacritical markers may appear after the leading character.

Linux commands

echo

localhost: $ echo hello

hello

localhost: $

The -n option to echo causes it not to print a LINEFEED after printing its arguments. For example:

localhost: $ echo -n hello

hellolocalhost: $

The -e option to echo turns on the interpretation of backslash-escaped characters. For example '\n' represents a NEWLINE and '\t' a tab. '\x' followed by two hexadecimal digits is the hexadecimal representation of one byte (a number between 0 and 255).

localhost: $ echo "\x3F\x21"

\x3F\x21

localhost: $ echo -e "\x3F\x21"

?!

localhost: $

Here, '\x3F' is the hexadecimal representation of the character '?' (1 byte) and '\x21' is the hexadecimal representation of the character '!'.

iconv

The iconv command converts encoding of given files from one encoding to another. Instead of specifying a filename one can pipe the output of a command to the iconv command. For example:

localhost: $ echo -ne "\xE5\x02" | iconv -fUNICODE -tUTF-8

˥ localhost: $

where '˥' is the unicode character U+02E5 (note the inverted order of the bytes).

Here UNICODE probably means UTF-16 coding (16-bit code unit, that is, a one unsigned 16-bit integer). The bash shell apparently uses UTF-8.

Some utility functions

With the echo and iconv commands we can build simple functions to display the characters corresponding to a given Unicode code number.

function unicode () 
  { echo -ne "\x${1:2:2}\x${1:0:2}" | iconv -f=UNICODE -t=UTF-8; 
    echo ""; 
  }

function unicode2 () 
  { echo -ne "\x${1:2:2}\x${1:0:2}\x${2:2:2}\x${2:0:2}" | iconv -f=UNICODE -t=UTF-8; 
    echo ""; 
  }

function unicode3 () 
  { echo -ne "\x${1:2:2}\x${1:0:2}\x${2:2:2}\x${2:0:2}\x${3:2:2}\x${3:0:2}" | iconv -f=UNICODE -t=UTF-8; 
    echo ""; 
  }

These functions take respectively one, two and three code numbers as input (without the 'U+'). The expression '${3:2:2}' means the substring of the third argument starting at the third character (the first character has index 0) and with length 2. This substring is then preceded with '\x' and concatenated to other similar expressions. The result is then via the echo command input to the iconv command.

localhost: $ unicode 2019

’

localhost: $ unicode2 006F 0349

o͉

localhost: $ unicode3 006F 0349 0301

ó͉

localhost: $

localhost: $ unicode 02E5

˥

localhost: $ echo ˥ | hexdump -C

00000000 cb a5 0a |...|

00000003

localhost: $

The UTF-8 code of '˥' (code number U+02E5) is 'cb a5'. '0a' is the LINEFEED added by the echo command.

UNICODE

Introduction

Hexadecimal notation

UNICODE

UTF-8

UTF-8 encoding

UTF-8 encoding tables

Combining characters

Linux commands

echo

iconv

Some utility functions

Ucodes used