# UNICODE

## Introduction

The dictionaries of Agheyisi, Melzian and Thomas as well as the corpus of Thomas contain many special characters. To display these characters the UNICODE UTF-8 standard is used.

An hexadecimal digit is one of the characters '0',...,'9','A',...,'F'. Each hexadecimal digit denotes a number:

``` '0'=0 '1'=1 '2'=2 '3'=3 '4'=4 '5'=5 '6'=6 '7'=7 '8'=8 '9'=9 ```

``` 'A'=10 'B'=11 'C'=12 'D'=13 'E'=14 'F'=15 ```

A hexadecimal number is a number written in hexadecimal notation, that is, with base 16. For example:

``` 203F = 2*16*16*16 + 0*16*16 + 3*16 + 14 = 8254 ```

## UNICODE

UNICODE is a standard that assigns a code number to a character (wikipedia). Most Unicode characters have a code number between 0 en 65,535 and thus need four hexadecimal digits in hexadecimal notation.

Codenumbers between 0 and 65,535 are usualy written as U+nnnn where n is a hexadecimal digit. For example, that character '‿' (UNDERTIE) has codenumber U+203F (8254 decimal).

An encoding is a mapping from code numbers (which represent characters) to sequences of code units. A code unit is in practice an octet (8-bit byte), a double octet (16-bit quantity), or a quadruple octet (32-bit quantity).

The encoding scheme used here is UTF-8.

## UTF-8

### UTF-8 encoding

Korpela, Jukka K - Unicode explained, O'Reilly (2006):

UTF-8 uses 8-bit code units, and it represents characters in the Basic Latin (ASCII) range U+0000 to U+007F efficiently, one code unit per character. On the other hand, this implies that all other characters use at least two code units, which all have the most significant bit set—i.e., they are in the range 80 to FF (hexadecimal). More exactly, they are in the range 80 to 9F. This means that when there is a code unit in the range 00 to 7F in UTF-8 data, we can know that it represents a Basic Latin character and cannot be part of the representation of some other character.

These structural decisions imply that UTF-8 is relatively inefficient, since it leaves many simple combinations unused. There is yet another principle that has a similar effect. In a representation of any character other than Basic Latin characters, the first (leading) code unit is from a specific range, and all the subsequent (trailing) code units are from a different range.

UTF-8 Encoding Algorithm

For a character outside the Basic Latin block, UTF-8 uses two, three, or four octets. You might encounter specifications that describe UTF-8 as using up to six octets per character, but they reflect definitions that did not restrict the Unicode coding space the way it has now been restricted.

The UTF-8 algorithm is described in Table 6-1. The first column specifies a bit pattern, in 16 or 21 bits, grouped for readability. The other columns indicate how the pattern is mapped to code units (octets), represented here as bit patterns.

Table 6-1. UTF-8 encoding algorithm
Code number in binary Octet 1 Octet 2 Octet 3 Octet 4
00000000 0xxxxxxx 0xxxxxxx
00000yyy yyxxxxxx 110yyyyy 10xxxxxx
zzzzyyyy yyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx
uuuww zzzzyyyy yyxxxxxx 11110uuu 10wwzzzz 10yyyyyy 10xxxxxx

Thus, the UTF-8 encoding uses bit combinations of very specific types in the octets. If you pick up an octet from UTF-8 encoded data, you can immediately see its role. If the first bit is 0, the octet is a single-octet representation of a (Basic Latin) character. Otherwise, you look at the second bit as well. If it is 0, you know that you have a second, third, or fourth octet of a multioctet representation of a character. Otherwise, you have the first octet of such a representation, and the initial bits 110, 1110, or 1111 reveal whether the representation is two, three, or four octets long.

Thus, interpreting (decoding) UTF-8 is straightforward, too. You take an octet, match it with the patterns in column “Octet 1” in Table 6-1, and read zero to three additional octets accordingly. Then you construct the binary representation of the code number from the bit sequences you extract from the octets. Naturally, nobody wants to do this by hand, but the point is that this can be implemented efficiently, as operations on bit fields. A correct implementation of Unicode has to signal an error, if there is data that does match any of the defined patterns.

A quick way to find out the UTF-8 encoding of a string is to visit http://www.goo gle.com on any modern browser, type the string into the keyword box, and hit Search. Then just look at the address field of the browser. For example, if you type pâté, the address field will contain http://www.google.com/search?hl=en&lr=&q=p%C3%A2t %C3%A9, so you can see that â is encoded as the octets C3 A2 and é as octets C3 A9. (In some situations, this does not work since Google does not use UTF-8. In that case, use the URL http://www.google.com/webhp?ie=UTF-8 to force the input encoding to UTF‑8.)

Some Properties of UTF-8

Due to the algorithm, the octets appearing in UTF-8 are limited to certain ranges, as shown in Table 6-2. In particular, octets C0 and C1 and F5 through FF do not appear in UTF-8. Other octets may appear in specific contexts only. This means that if you have a large file that is not, in fact, character data in UTF-8 and you try to read it as UTF-8, it is most probable that errors will be signaled.

Table 6-2. Octet ranges in UTF-8
Code range Octet 1 Octet 2 Octet 3 Octet 4
U+0000..U+007F 00..7F
U+0080..U+07FF C2..DF 80..BF
U+0800..U+0FFF E0 A0..BF 80..BF
U+1000..U+CFFF E1..EC 80..BF 80..BF
U+D000..U+D7FF ED 80..9F 80..BF
U+E000..U+FFFF EE..EF 80..BF 80..BF
U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
U+100000..U+10FFFF F4 80..8F 80..BF 80..BF

Similarly to UTF-16, UTF-8 makes it impossible to access the nth character of a string directly. UTF-8 is robust, though: if a code unit is corrupted, other characters will be processed correctly. The reason is that UTF-8 has been designed so that a code unit starting the representation of a character can be recognized as such, even if the preceding code unit is in error.

### UTF-8 encoding tables

A table of Unicode characters with the UTF-8 code is given at the website UTF-8 encoding table and Unicode characters

The table is devided into blocks. Entries in a block are displayed as follows:

```    U+2015	―	e2 80 95	HORIZONTAL BAR
```

The first column contains the Unicode code number.

The second column contains the character.

The third column contains the UTF-8 code (hex.)

The fourth column contains the name of the character.

### Combining characters

The Unicode characters between U+0300 and U+036F are diacritical marks that combine with the preceding Unicode character. A normal character can be followed by several diacritical markers.

For example:

ọ́  `U+006f U+0301 U+0323`

with

`U+006F  `o `  6f     LATIN SMALL LETTER O`

`U+0301  ` ́ `  cc 81  COMBINING ACUTE ACCENT`

`U+0323  ` ̣ `  cc a3  COMBINING DOT BELOW`

Not all display software will display such characters correctly. The diacritical markers may appear after the leading character.

### Linux commands

#### echo

The echo command prints its arguments. For example:

`localhost: \$ echo hello`

`hello`

`localhost: \$ `

The -n option to echo causes it not to print a LINEFEED after printing its arguments. For example:

`localhost: \$ echo -n hello`

`hellolocalhost: \$ `

The -e option to echo turns on the interpretation of backslash-escaped characters. For example '\n' represents a NEWLINE and '\t' a tab. '\x' followed by two hexadecimal digits is the hexadecimal representation of one byte (a number between 0 and 255).

For example:

`localhost: \$ echo "\x3F\x21"`

`\x3F\x21`

`localhost: \$ echo -e "\x3F\x21"`

`?!`

`localhost: \$ `

Here, '\x3F' is the hexadecimal representation of the character '?' (1 byte) and '\x21' is the hexadecimal representation of the character '!'.

#### iconv

The iconv command converts encoding of given files from one encoding to another. Instead of specifying a filename one can pipe the output of a command to the iconv command. For example:

`localhost: \$ echo -ne "\xE5\x02" | iconv -fUNICODE -tUTF-8`

˥ `localhost: \$ `

where '˥' is the unicode character U+02E5 (note the inverted order of the bytes).

Here UNICODE probably means UTF-16 coding (16-bit code unit, that is, a one unsigned 16-bit integer). The bash shell apparently uses UTF-8.

#### Some utility functions

With the echo and iconv commands we can build simple functions to display the characters corresponding to a given Unicode code number.

```function unicode ()
{ echo -ne "\x\${1:2:2}\x\${1:0:2}" | iconv -f=UNICODE -t=UTF-8;
echo "";
}

function unicode2 ()
{ echo -ne "\x\${1:2:2}\x\${1:0:2}\x\${2:2:2}\x\${2:0:2}" | iconv -f=UNICODE -t=UTF-8;
echo "";
}

function unicode3 ()
{ echo -ne "\x\${1:2:2}\x\${1:0:2}\x\${2:2:2}\x\${2:0:2}\x\${3:2:2}\x\${3:0:2}" | iconv -f=UNICODE -t=UTF-8;
echo "";
}
```

These functions take respectively one, two and three code numbers as input (without the 'U+'). The expression '\${3:2:2}' means the substring of the third argument starting at the third character (the first character has index 0) and with length 2. This substring is then preceded with '\x' and concatenated to other similar expressions. The result is then via the echo command input to the iconv command.

For example:

`localhost: \$ unicode 2019`

`localhost: \$ unicode2 006F 0349`

`localhost: \$ unicode3 006F 0349 0301`

ó͉

`localhost: \$ `

The UTF-8 code of a character can be display with the 'hexdump -C' command.

`localhost: \$ unicode 02E5`

˥

`localhost: \$ echo `˥` | hexdump -C`

`00000000 cb a5 0a |...|`

`00000003`

`localhost: \$ `

The UTF-8 code of '˥' (code number U+02E5) is 'cb a5'. '0a' is the LINEFEED added by the echo command.

## Ucodes used

TBD formatting characters in de masterfiles

Last update: 14-10-2022