Word Alignments

From HLT@INESC-ID

Revision as of 15:45, 30 June 2008 by Javg (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
  Manually annotated word alignments for six different language pairs. 
  * Portuguese - English
  * Portuguese - French
  * Portuguese - Spanish
  * English - Spanish 
  * English - French 
  * French - Spanish

Please cite the following paper in case of using the corpus:

João de Almeida Varelas Graça, Joana Paulo Pardal, Luísa Coheur, Diamantino António Caseiro, Building a golden collection of parallel Multi-Language Word Alignment, In The 6th International Conference on Language Resources and Evaluation, LREC 2008, May 2008

Contents

The corpus is taken from the publicly available Europarl Corpus that contains proceedings of the European parliament in the different official languages. The golden collection is built over the first 100 sentences of the common test taken from Q4/2000 portion of the data (2000-10 to 2000-12). The common test set can be download from Europarl archives. The common test set is already tokenized and lowercased.

Guidelines

Guidelines followed to produce the manual word alignments over six different language pairs (all combinations between Portuguese, English, French and Spanish) (PDF).

Download

Golden collection of parallel multi-language word alignments

(more information on the Speech-to-speech Translation information page)