Masterfile format

A masterfile contains dictionary entries in a raw format from which other formats (ex. html, epub, database) can be generated. There are two masterfiles for Melzian's dictionary, one for the BINI DICTIONARY and one for the LIST OF ADDENDA.

The masterfiles are UTF-8 text files sandwiched between <pre> and </pre> HTML-markers and preceded by the <meta charset="utf-8"/> HTML-marker. In the following 'masterfile' refers to the text of the masterfile without the HTML additions.

Each of the two masterfiles comes in two formats. In Linux format the lines are ended by a 'Line feed' (ASCII x0A). In DOS format the lines are ended by a 'Carriage Return' (ASCII x0D) + 'Line feed' (ASCII x0A). Otherwise, the files are identical.

There are 3388 entries in the BINI DICTIONARY and 82 in the LIST OF ADDENDA. Each entry is one line in the masterfile. The largest line in the masterfiles has 8966 ascii characters (bytes). That is the entry for gbe 1. Some display devices have problems with text files with long lines.

The masterfiles contain characters that do not appear in the dictionary itself. These are the characters: '$', '#', '%', '|', '_', '+'. These characters contain formatting information.

Each page of the dictionary (pages xvii-xviii and 1-233 of the book) has two columns. These columns are identified by column-markers: '$Page-nnn-L$' and '$Page-nnn-R$' for the left and right column where 'nnn' is '000', '001' for the LIST OF ADDENDA, which comprises only two pages (pages xvii-xviii of the book), and '001',...,'233' for the BINI DICTIONARY. These column-markers appear in the masterfiles before the entries in the column. The first column-marker appears as the first line in each masterfile. The other column-markers are appended to the last entry of the previous column.

Each entry in a masterfile starts with '# '. Thus each line in a masterfile except the first one starts with '# '. Sometimes a column in the dictionary starts in the middle of an entry e.g. '$Page-003-R$'. this is indicated with '#-'. This happens 0 times in the LIST OF ADDENDA and 307 times in the BINI DICTIONARY.

The end of a column-line is indicated with ' %' (note the space). Sometimes, a column-line ends with a hyphen. These appear in a masterfile as '- %'. Most of the time the hyphen indicates word-splitting at the end of a line. But in many cases the hyphen is part of the word itself, both in English and in Edo. For example, 'witch-doctor' in the entry for abã. This happens 2 times out of 5 in the LIST OF ADDENDA and 510 times out of 1705 in the BINI DICTIONARY. These cases are indicated with '-- %'. The decision whether the hyphen is part of the word or not has been made by examining each case individually.

Edo words are enclosed by '|'.

Words in italics are enclosed by '_'.

The text of the dictionary has been digitized as it appears in the book including possible errors. In some cases however errors have been corrected. To convert a masterfile into another format the masterfile has to be parsed. Matching between opening and closing parentheses, brackets and quotes were needed to simplify parsing. Missing opening or closing parentheses, brackets or quotes have been added. These modifications are marked with '+' and were made 0 times in the LIST OF ADDENDA and 29 times in the BINI DICTIONARY.

One has to compare the digitized text with the text of the dictionary to see exactly what modification was made. For example '(+' can mean that an opening parenthesis as been added but also that an opening bracket '[' which was matched by a closing parenthesis ')' has been changed into an opening parenthesis. A character that has been deleted is indicated with only '+'.


Last update: 29-01-2025