1st TextLink Training School - Machine translation techniques to induce multilingual lexica of discourse markers: Difference between revisions

Revision as of 18:53, 22 January 2016

This page contains a basic description of the overall translation process and miscellaneous links and resources. If in doubt, please contact me. david.matos@inesc-id.pt

The Motivation

The idea behind using translation techniques for building lexica of discourse markers in multiple languages arose from the need to perform discourse segmentation of texts and labeling of discourse relations and rhetorical status of discourse units. All of these tasks are hard and require tools informed with knowledge about the form and function of discourse units and devices. It is a hard task, complicated by the fact that you may not have the resources for a specific target language.

In his PhD thesis, Daniel Marcu describes a shallow algorithm that, relying mostly on discourse connectives, achieves good results in the task of discourse segmentation. He is able to build discourse trees using a list of validated cue phrases and other features. From these trees, he is then able to infer logic structures (these will not be covered here -- at this stage we are only interested in finding lists of discourse markers).

We base our strategy on the (not too unreasonable) assumption that discourse structuring by humans is a mental process that is realized linguistically in ways compatible with statistical machine translation (SMT) processes.

Schematically, we want a black box that takes a list of English discourse connectives and produces a corresponding list of candidate translations in a target language.

The Approach

The best way to ensure that translations are not biased is to use large or very large corpora, preferably covering distinct domains and produced by different speakers/authors. This is ensures variability, thus avoiding bias.

One corpus that satisfies these requirements is the Europarl corpus, the European Parliament Proceedings Parallel Corpus 1996-2011. From the website: "The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek."

The Europarl corpus has been extensively used in other SMT works, so its behavior is well-known and specific tools exist to process it and prepare it for training phrase alignment models for tools such as the ones described below.

In view of the above, our SMT box will consist of Europarl-derived phrase alignment models. To train these models, we are going to make use of three different sets of tools (some of which have been adapted to account for specific corpus issues):

The Europarl tool set - these tools are specific to the corpus (the alignment tool has been adapted - see below); they will be used to prepare the corpus for the later stages.
GIZA++ - a machine learning toolkit to train word alignment models (uses other tools, such as mkcls).
Moses - this is a full-featured machine translation system able to coordinate other tools to produce translations based on the trained models; we use it to drive some of the other tools when training models.
filter-pt - this is a contributed tool in the Moses toolkit that, using suffix arrays built by the salm tool, drastically reduces the size of the phrase tables while maintaining translation quality.

After the models have been built and we have the compressed/pruned phrase tables, all that is left to do is extract the translations. This can be done in a variety of ways: directly using the phrase table (we will this approach as a baseline) or using Moses to translate the input (this may not be the best approach, since we are not translating phrases). Also, we won't be using any language model of the target language. This means that we will rely only on the contents of the phrase table.

Training the Translation Model

There are 7 steps in the training process. Step 6, the training process proper is a complex set of procedures that takes the prepared corpus and trains the phrase alignment model.

The following list presents the steps and the estimated times for the en-fr language pair (these time values vary according to the size of the corpus):

Step 1: raw sentence alignment: 30+ minutes
Step 2: tag removal: <1 minute
Step 3: tokenization: 5 minutes
Step 4: lowercasing: <1 minute
Step 5: phrase chopping: <3 minutes
Step 6: phrase alignment model training: 12 to 17 hours (small corpora ~3 hours).
Step 7: indexing and pruning: ~90 minutes

The sandbox implements these steps in the build_pt.sh ("build phrase table") script. Since it is a very long running script, it has been broken down into step sets (as show below).

Step 1: Sentence Alignment

Sentence alignment must ensure that all sentences or similar units (e.g., paragraphs) are correctly aligned in each language pair. Furthermore, each sentence must be in a single line. In the specific case of the Europarl corpus, this means that a multi-line paragraph must sometimes be collapsed into a single line.

This step is implemented in build_pt_1_1.sh.

sentence-align-corpus.perl $EUROPARL_DIR $L1 $L2 $OUTPUT_DIR
cat $OUTPUT_DIR/$L1-$L2/$L1/* > europarl-step1-$L1-$L2.$L1
cat $OUTPUT_DIR/$L1-$L2/$L2/* > europarl-step1-$L1-$L2.$L2

($L1 is a language, e.g. "en"; $L2 is another language, e.g. "fr")

Alignment results for all languages pairs (only against English, our input language) are available here. If these results are used, the original corpus is not necessary.

Step 2: Tag Removal

The Europarl corpus has XML tags for providing meta-information (e.g., for identifying speakers). However, these tags are not useful in our translation context and must be removed.

This step is implemented in build_pt_2_5.sh.

de-xml.perl europarl-step1-$L1-$L2 $L1 $L2 europarl-step2-$L1-$L2

Tag removal results for all language pairs (against English) are available here. If these results are used, the previous step is not necessary.

Step 3: Tokenization

The tokenization process isolates each token (words, punctuation, and so on), so that alignments are improved. Note that all punctuation and other marks are preserved, although in the specific case of discourse marker translation, they should probably be removed. You are invited to perform the experiment and compare the results (this requires adjusting the tokenizer provided by the Europarl toolset).

This step is implemented in build_pt_2_5.sh.

tokenizer.perl -l $L1 < europarl-step2-$L1-$L2.$L1 > europarl-step3-$L1-$L2.$L1
tokenizer.perl -l $L2 < europarl-step2-$L1-$L2.$L2 > europarl-step3-$L1-$L2.$L2

Tokenization results for all language pairs (against English) are available here. If these results are used, the previous steps are not necessary.

Step 4: Normalization

In this specific case, normalization is limited to converting every character to lowercase.

This step is implemented in build_pt_2_5.sh.

lowercase.pl < europarl-step3-$L1-$L2.$L1 > europarl-step4-$L1-$L2.$L1
lowercase.pl < europarl-step3-$L1-$L2.$L2 > europarl-step4-$L1-$L2.$L2

Normalization results for all language pairs (against English) are available here. If these results are used, the previous steps are not necessary.

Step 5: Cleaning

This step cleans the corpus while maintaining the alignments. The most important aspect is the removal of too-short or too-long sentences (or both).

This step is implemented in build_pt_2_5.sh.

training/clean-corpus-n.perl europarl-step4-$L1-$L2 $L1 $L2 europarl-step5-$L1-$L2 $MIN_PHRASE_LENGTH $MAX_PHRASE_LENGTH

clean-corpus-n.perl is part of the training tool set provided with Moses. In our case $MIN_PHRASE_LENGTH is 0 (zero) and $MAX_PHRASE_LENGTH is 120. Other values could (and should) be experimented with.

Cleaning results for all language pairs (against English) are available here. Note that the indexing results produced by step 7 are also included in the results files. If these results are used, the previous steps are not necessary.

Step 6: Training

This is translation model training step proper. It involves running GIZA++ and helper tools. Moses streamlines the process: see the detailed breakdown.

This step is implemented in build_pt_6_6.sh.

training/train-model.perl --root-dir $OUTPUT_DIR/$L1-$L2/europarl-step6-$L1-$L2 --corpus europarl-step5-$L1-$L2 -f $L2 -e $L1 --external-bin-dir /usr/bin --last-step 8 --parallel --reordering wbe-msd-bidirectional-fe-allff --generation-factors 0-0

Training results (already including the pruned tables, produced by step 7) for all language pairs (against English) are available here. Avoid downloading these files (they are very large): they are not actually needed, unless you want to study them.

Raw phrase tables (the output of step 6) are available here. Note, however, that the pruned phrase tables (the output of step 7) are what interests us (they are much much smaller than raw ones). They are avaliable here.

Step 7: Indexing and Pruning

This step is not strictly needed. However, it produces phrase tables that are just 10% of the size of the originals, without loss in translation quality. This constitutes a strong incentive to using them.

Indexing is done resorting to the salm tool (it builds a suffix array for each of the two languages). Note that the salm implementation does not support more than 254 words per sentence (this can be taken into account during the cleaning step).

This step is implemented in build_pt_7_7.sh.

IndexSA.O64 europarl-step5-$L1-$L2.$L1
IndexSA.O64 europarl-step5-$L1-$L2.$L2
zcat $OUTPUT_DIR/$L1-$L2/europarl-step6-$L1-$L2/model/phrase-table.0-0.gz | filter-pt -e europarl-step5-$L1-$L2.$L1 -f europarl-step5-$L1-$L2.$L2 -l a+e -n 30 | gzip -v9 > $OUTPUT_DIR/$L1-$L2/europarl-step6-$L1-$L2/model/phrase-table.0-0.pruned.gz

Indexing and pruning results for all language pairs (against English) are available here (complements step 5) and here (complements step 6).

Note that the pruned phrase tables are what interests us (they are much much smaller than the other outputs). They are avaliable here.

The Sandbox

The sandbox is provided as a way to organize the preparation of the corpus, training of the models, and use of the produced phrase tables in the extraction of the translated markers for the provided input.

It has the following structure:

cues.txt -- a list of English discourse cues (from Daniel Marcu's PhD thesis).
build_pt.sh and friends -- they are as described above.
align-corpus -- this directory contains helper scripts that complement the Europarl tools for corpus preparation.
extract-translations -- this directory contains a baseline implementation of the discourse markers translator: it first selects candidates; these are filtered according to several criteria.

The sandbox can be used as long as the software needed by its scripts is present in the machine where it is used. If some (or all) of the tools are not present, then the output files described above must be used instead). The extraction scripts can be run as long as the phrase tables are present in the sandbox.

The Virtual Machine

An openSUSE (Linux) virtual machine containing all the needed software for preparing the Europarl corpus (including the corpus itself) and for training the alignment models has been prepared.

It is available here: https://susestudio.com/a/sD7EYX/textlink

A compatible repository containing packages for openSUSE Leap 42.1 is available here: http://download.opensuse.org/repositories/home:/inescid:/language:/textlink/openSUSE_Leap_42.1/

The virtual machine does not included the sandbox.

Other Materials

The full list of provided materials is available here: https://www.l2f.inesc-id.pt/~david/textlink/

Short link to this page: https://goo.gl/3jicxV

@@ Line 20: / Line 20: @@
 One corpus that satisfies these requirements is the Europarl corpus, the [http://www.statmt.org/europarl/ European Parliament Proceedings Parallel Corpus 1996-2011]. From the website: "The Europarl parallel corpus is extracted from the proceedings of the [http://www.europarl.europa.eu/portal/en European Parliament]. It includes versions in 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek."
-The Europarl corpus has been extensively used in other SMT works, so its behavior is well known and specific tools exist to process it and prepare it for training phrase alignment models for tools such as the ones described below.
+The Europarl corpus has been extensively used in other SMT works, so its behavior is well-known and specific tools exist to process it and prepare it for training phrase alignment models for tools such as the ones described below.
 In view of the above, our SMT box will consist of Europarl-derived phrase alignment models. To train these models, we are going to make use of three different sets of tools (some of which have been adapted to account for specific corpus issues):
 * [http://www.statmt.org/europarl/ The Europarl tool set] - these tools are specific to the corpus (the alignment tool has been adapted - see below); they will be used to prepare the corpus for the later stages.
 * [https://github.com/moses-smt/giza-pp GIZA++] - a machine learning toolkit to train word alignment models (uses other tools, such as mkcls).
-* [http://www.statmt.org/moses/ Moses] - this is a full featured machine translation system able to coordinate other tools to produce translations based on the trained models; we use it to [http://www.statmt.org/moses/?n=FactoredTraining.HomePage drive some of the other tools when training models].
+* [http://www.statmt.org/moses/ Moses] - this is a full-featured machine translation system able to coordinate other tools to produce translations based on the trained models; we use it to [http://www.statmt.org/moses/?n=FactoredTraining.HomePage drive some of the other tools when training models].
 * [https://github.com/moses-smt/mosesdecoder/tree/master/contrib/sigtest-filter filter-pt] - this is a contributed tool in the Moses toolkit that, using suffix arrays built by the [https://github.com/moses-smt/salm salm tool], drastically reduces the size of the phrase tables while maintaining translation quality.

1st TextLink Training School - Machine translation techniques to induce multilingual lexica of discourse markers: Difference between revisions

From Wiki**3

Revision as of 18:53, 22 January 2016

Contents

The Motivation

The Approach

Training the Translation Model

Step 1: Sentence Alignment

Step 2: Tag Removal

Step 3: Tokenization

Step 4: Normalization

Step 5: Cleaning

Step 6: Training

Step 7: Indexing and Pruning

The Sandbox

The Virtual Machine

Other Materials