ONOMASTICA

From HLT@INESC-ID

Revision as of 23:04, 14 February 2006 by Root (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

ONOMASTICA (Language Research Engineering) - Multi-Language Pronounciation Dictionary of Proper Names and Place Names (1993-1995).

Summary

The ONOMASTICA project was a European wide research initiative within the scope of the Linguistic Research and Engineering Programme, whose aim was the construction of a multi-language pronunciation lexicon of proper names. The project covered eleven European languages: Danish, Dutch, English, French, German, Greek, Italian, Norwegian, Portuguese, Spanish and Swedish. Eleven associated partners from telephone companies have provided data files including names of persons, cities, towns, streets and companies. One of the main goals of the project was to derive pronunciation dictionaries for up to one million names per language in a semi-automatic way.

In general, the performance of grapheme-to-phoneme conversion systems for proper names is much worse than the one observed for the common lexicon. This fact is not surprising since in most languages the names may obey to different morphological and phonological rules compared to ordinary words. Part of the problem derives from the mobility of names, as they move with people from one country to another, showing different degrees of adjustment to the sound structure of the language in which they surface. Other sources of difficulty can, however, be found. The orthography of last names can be rather conservative and, as it does not conform anymore to the general orthographic rules, its phonetic interpretation is sometimes misleading. Furthermore, some applications imply the ability of generating correct pronunciations for acronyms which, for some languages, can follow rules significantly different from the ones observed for the common lexicon.

Part of the work in this project was therefore aimed at upgrading existing rule engines to cope with the problems posed by proper names. A significant part of the work was also devoted to the development of self-learning grapheme-to-phoneme conversion methods and the comparison of their performance with the one of rule-based methods. These self-learning approaches included both conventional backpropagation and self-organizing neural networks, as well as various symbolic learning techniques, ranking from analogy-based learning to table look-up. This latter approach, developed by CPK was tested by most of the partners, allowing inter-language assessment of its performances.

The number of entries in the ONOMASTICA lexicon significantly differs from language to language, ranging from one hundred thousand to more than one million. These were all automatically processed to provide broad phonetic transcriptions. A large percentage of these transcriptions was manually verified by at least one trained phonetician, who provided up to 5 alternative pronunciations for each entry, tagged with the corresponding category (first name, surname, company name, street name, town name, and region name) and in some languages where this information was available with its etymology and frequency of occurrence. Quality assurance measures have played a key role throughout the project. Thus, three quality bands have been identified, depending on the certainty of the transcriptions (I: verified by a transcriber who is certain of its correctness; II: verified by a transcriber with some uncertainty; III: not verified). One thousand entries from each band have been randomly selected and their correctness was judged by independent auditors from each language. Part of this ONOMASTICA pronunciation lexicon, which totals 8.5 million European names, is included in a CD-ROM with currently 25,000 band I entries from eight languages.

Another important goal of this project was to investigate the problems of exchanging national names amongst the partners to create a matrix lexicon of 'nativised' pronunciations for each foreign name in each other language. Whereas the set of the 11 national pronunciation lexicons is directly suited to immediate exploitation, particularly in the development of telecommunications applications, the inter-language lexicon should be viewed rather as a research tool. It is limited to 1000 names per language and therefore contains 11,000 entries, with eleven transcriptions each, amounting to a total of 121,000 transcriptions.

The design criterion for this database was primarily to emphasize the potential of use of this type of lexicons in multi-lingual speech recognition applications involving users in different European countries. With this in mind, the selected vocabulary was restricted to names of cities, towns, airports, stations, monuments and landmarks whose size, historical significance, and geographical importance (in terms of transport, namely) justify their inclusion in touristic guidebooks. The targeted applications are the ones which are most likely to be used by non-native users, thus implying the recognition of considerably different pronunciations: travel information, flight booking, weather forecasting, road report systems, etc. One of the most interesting aspects of the work on the inter-language matrix lexicon consisted in defining "nativised" pronunciations. We have emphasized the factors influencing nativisation, and compared different degrees of adjustment to the sound structure of foreign languages.

An application programmers' interface has been developed to provide a convenient method to access the data held on CD-ROM. Written in C, it can be used either from DOS or Windows, offering the basic functions to open, search, read, and close a data file. A Visual Basic program has also been developed to demonstrate the use of the API calls.

Although the ONOMASTICA project ended in June 1995, the work continued at least until 1997 with the introduction of new partners, addressing the names of Eastern and Central European names - Czech, Estonian, Latvian, Polish, Romanian, Slovakian, Slovenian and Ukrainian, in a new project funded by the EC Copernicus Programme.

The Portuguese partner in ONOMASTICA was INESC, in the scope of its cooperation agreement with CLUL. The database of names of persons, streets, places and companies was provided by TLP (later Portugal Telecom), the associate Portuguese partner in this project. The main researchers involved in this team were, on behalf of INESC, Isabel Trancoso and, on behalf of CLUL, M. Céu Viana. Useful contributions to this work were made also by Isabel Mascarenhas (CLUL) and Fernando M. Silva (Neural Networks Group of INESC).

Isabel Trancoso
Wed Aug 13 15:29:52 WET DST 1997