1st TextLink Training School - Machine translation techniques to induce multilingual lexica of discourse markers
From Wiki**3
This page contains a basic description of the overall translation process and miscellaneous links and resources. If in doubt, please contact me.
The Motivation
The idea behind using translation techniques for building lexica of discourse markers in multiple languages arose from the need to perform discourse segmentation of texts and labeling of discourse relations and rhetorical status of discourse units. All of these tasks are hard and require tools informed with knowledge about the form and function of discourse units and devices. It is a hard task, complicated by the fact that you may not have the resources for a specific target language.
In his PhD thesis, Daniel Marcu describes a shallow algorithm that, relying mostly on discourse connectives, achieves good results in the task of discourse segmentation. He is able to build discourse trees using a list of validated cue phrases and other features. From these trees, he is then able to infer logic structures (these will not be covered here -- at this stage we are only interested in finding lists of discourse markers).
We base our strategy on the (not too unreasonable) assumption that discourse structuring by humans is a mental process that is realized linguistically in ways compatible with statistical machine translation (SMT) processes.
Schematically, we want a black box that takes a list of English discourse connectives and produces a corresponding list of candidate translations in a target language.
The Approach
The best way to ensure that translations are not biased is to use large or very large corpora, preferably covering distinct domains and produced by different speakers/authors. This is ensures variability, thus avoiding bias.
One corpus that satisfies these requirements is the Europarl corpus, the European Parliament Proceedings Parallel Corpus 1996-2011. From the website: "The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek."
The Europarl corpus has been extensively used in other SMT works, so its behavior is well known and specific tools exist to process it and prepare it for training phrase alignment models for tools such as the ones described below.
In view of the above, our SMT box will consist of Europarl-derived phrase alignment models. To train these models, we are going to make use of three different sets of tools (some of which have been adapted to account for specific corpus issues):
- The Europarl tool set - these tools are specific to the corpus (the alignment tool has been adapted - see below); they will be used to prepare the corpus for the later stages.
- GIZA++ - a machine learning toolkit to train word alignment models (uses other tools, such as mkcls).
- Moses - this is a full featured machine translation system able to coordinate other tools to produce translations based on the trained models; we use it to drive some of the other tools when training models.
- filter-pt - this is a contributed tool in the Moses toolkit that, using suffix arrays built by the salm tool, drastically reduces the size of the phrase tables while maintaining translation quality.
After the models have been built and we have the compressed/pruned phrase tables, all that is left to do is extract the translations. This can be done in a variety of ways: directly using the phrase table (we will this approach as a baseline) or using Moses to translate the input (this may not be the best approach, since we are not translating phrases). Also, we won't be using any language model of the target language. This means that we will rely only on the contents of the phrase table.
Training the Translation Model
There are 7 steps in the training process. Step 6, the training process proper is a complex set of procedures that takes the prepared corpus and trains the phrase alignment model.
The following list presents the steps and the estimated times for the en-fr language pair (these time values vary according to the size of the corpus):
- Step 1: raw sentence alignment: 30+ minutes
- Step 2: tag removal: <1 minute
- Step 3: tokenization: 5 minutes
- Step 4: lowercasing: <1 minute
- Step 5: phrase chopping: <3 minutes
- Step 6: phrase alignment model training: 12 to 17 hours (small corpora ~3 hours).
- Step 7: indexing and pruning: ~90 minutes
Sentence Alignment
Tag Removal
Tokenization
Normalization
Cleaning
Training
See the detailed breakdown.