Manually annotated word alignments for six different language pairs.
Please cite the following paper in case of using the corpus:
João de Almeida Varelas Graça, Joana Paulo Pardal, Luísa Coheur, Diamantino António Caseiro, Building a golden collection of parallel Multi-Language Word Alignment, In The 6th International Conference on Language Resources and Evaluation, LREC 2008, May 2008
The corpus is taken from the publicly available Europarl Corpus that contains proceedings of the European parliament in the different official languages. The golden collection is built over the first 100 sentences of the common test taken from Q4/2000 portion of the data (2000-10 to 2000-12). The common test set can be download from Europarl archives. The common test set is already tokenized and lowercased.
Guidelines followed to produce the manual word alignments over six different language pairs (all combinations between Portuguese, English, French and Spanish) (PDF).