Manually annotated word alignments for six different language pairs.
Please cite the following paper in case of using the corpus:
The corpus is taken from the publicly available Europarl Corpus that contains proceedings of the European parliament in the different official languages. The golden collection is built over the first 100 sentences of the common test taken from Q4/2000 portion of the data (2000-10 to 2000-12). The common test set can be download from Europarl archives. The common test set is already tokenized and lowercased.
Guidelines followed to produce the manual word alignments over six different language pairs (all combinations between Portuguese, English, French and Spanish) (PDF).
(more information on the Speech-to-speech Translation information page)