Lexicon extraction from bilingual comparable corpora
From HLT@INESC-ID
Date
- 16:00, Monday, February 08th, 2010
- Room 336
Speaker
- Luís Carvalho, L2F
Abstract
Parallel corpora is an expensive resource to come by in Machine Translation Systems. Since it was proved that even in unrelated texts of different languages patterns of words co-occurring with each other are preserved, non-parallel texts became part of these systems over parallel corpora. Comparable corpora is a specific type of non-parallel texts with high level of comparability, that is, they point to the same subject, have similar time window and size. This type of corpora is preferred over parallel corpora not only due to its high abundance, but also because it is easily accessible via web. The ob jective of this work is to build a bilingual lexicon from a source language to a target language using comparable corpora. For that purpose, the system is composed by two modules: one is responsible for the detection of cognate words using different approaches like verbatim detection, rule based detection, non-rule based detection and sound based detection. The potential equivalents collected are extracted using similarity measures. The other module uses a characteristic found in comparable texts: context preservation between words across the corpora, that is, the context of a given word in the source language tend to be similar to the context of its translation in the target language. Then, for each word, co-occurrences of context words are counted and stored in context vectors which are further compared with all target vectors using similarity measures. These modules combined may form an efficient platform of automatic translation between equivalents of two languages in the creation of a bilingual lexicon.