Word Alignments: Difference between revisions

From HLT@INESC-ID

mNo edit summary
 
No edit summary
 
(2 intermediate revisions by 2 users not shown)
Line 1: Line 1:
Manually annotated word alignments for six different language pairs.  
[[Image:gold-alignment.png|200px|left|]]
 
* Portuguese - English
* Portuguese - French
* Portuguese - Spanish
* English - Spanish
* English - French
* French - Spanish


  Manually annotated word alignments for six different language pairs.
  * Portuguese - English
  * Portuguese - French
  * Portuguese - Spanish
  * English - Spanish
  * English - French
  * French - Spanish
Please cite the following paper in case of using the corpus:
Please cite the following paper in case of using the corpus:
 
: João de Almeida Varelas Graça, Joana Paulo Pardal, Luísa Coheur, Diamantino António Caseiro, [http://www.inesc-id.pt/pt/indicadores/Ficheiros/4735.pdf Building a golden collection of parallel Multi-Language Word Alignment], In The 6th International Conference on Language Resources and Evaluation, LREC 2008, May 2008
João de Almeida Varelas Graça, Joana Paulo Pardal, Luísa Coheur, Diamantino António Caseiro, [http://www.inesc-id.pt/pt/indicadores/Ficheiros/4735.pdf Building a golden collection of parallel Multi-Language Word Alignment], In The 6th International Conference on Language Resources and Evaluation, LREC 2008, May 2008
 


== Contents ==
== Contents ==
The corpus is taken from the publicly available [http://www.statmt.org/europarl/ Europarl Corpus]  that contains proceedings of the European parliament in the different official languages.  
The corpus is taken from the publicly available [http://www.statmt.org/europarl/ Europarl Corpus]  that contains proceedings of the European parliament in the different official languages.  
The golden collection is built over the first 100 sentences of the common test taken from Q4/2000 portion of the data (2000-10 to 2000-12). The common test set can be download from [http://www.statmt.org/europarl/archives.html Europarl  archives]. The common test set is already tokenized and lowercased.  
The golden collection is built over the first 100 sentences of the common test taken from Q4/2000 portion of the data (2000-10 to 2000-12). The common test set can be download from [http://www.statmt.org/europarl/archives.html Europarl  archives]. The common test set is already tokenized and lowercased.  


== Guidelines ==  
== Guidelines ==  
Guidelines followed to produce the manual word alignments over six different language pairs (all combinations between Portuguese, English, French and Spanish) ([http://www.inesc-id.pt/pt/indicadores/Ficheiros/4734.pdf PDF]).
Guidelines followed to produce the manual word alignments over six different language pairs (all combinations between Portuguese, English, French and Spanish) ([http://www.inesc-id.pt/pt/indicadores/Ficheiros/4734.pdf PDF]).


== Download ==
== Download ==
'''[http://www.l2f.inesc-id.pt/resources/translation/golden_collection.zip Golden collection of parallel multi-language word alignments]'''


'''[http://www.l2f.inesc-id.pt/resources/translation/golden_collection.zip Golden collection of parallel multi-language word alignments]'''
(more information on the [[Speech-to-speech Translation]] information page)

Latest revision as of 15:45, 30 June 2008

  Manually annotated word alignments for six different language pairs. 
  * Portuguese - English
  * Portuguese - French
  * Portuguese - Spanish
  * English - Spanish 
  * English - French 
  * French - Spanish

Please cite the following paper in case of using the corpus:

João de Almeida Varelas Graça, Joana Paulo Pardal, Luísa Coheur, Diamantino António Caseiro, Building a golden collection of parallel Multi-Language Word Alignment, In The 6th International Conference on Language Resources and Evaluation, LREC 2008, May 2008

Contents

The corpus is taken from the publicly available Europarl Corpus that contains proceedings of the European parliament in the different official languages. The golden collection is built over the first 100 sentences of the common test taken from Q4/2000 portion of the data (2000-10 to 2000-12). The common test set can be download from Europarl archives. The common test set is already tokenized and lowercased.

Guidelines

Guidelines followed to produce the manual word alignments over six different language pairs (all combinations between Portuguese, English, French and Spanish) (PDF).

Download

Golden collection of parallel multi-language word alignments

(more information on the Speech-to-speech Translation information page)