1st TextLink Training School - Machine translation techniques to induce multilingual lexica of discourse markers: Difference between revisions

Revision as of 09:11, 22 January 2016

This page contains a basic description of the overall translation process and miscellaneous links and resources. If in doubt, please contact me.

The Motivation

The idea behind using translation techniques for building lexica of discourse markers in multiple languages arose from the need to perform discourse segmentation of texts and labeling of discourse relations and rhetorical status of discourse units. All of these tasks are hard and require tools informed with knowledge about the form and function of discourse units and devices. It is a hard task, complicated by the fact that you may not have the resources for a specific target language.

In his PhD thesis, Daniel Marcu describes a shallow algorithm that, relying mostly on discourse connectives, achieves good results in the task of discourse segmentation. He is able to build discourse trees using a list of validated cue phrases and other features. From these trees, he is then able to infer logic structures (these will not be covered here -- at this stage we are only interested in finding lists of discourse markers).

We base our strategy on the (not too unreasonable) assumption that discourse structuring by humans is a mental process that is realized linguistically in ways compatible with statistical machine translation (SMT) processes.

Schematically, we want a black box that takes a list of English discourse connectives and produces a corresponding list of candidate translations in a target language.

The Approach

The best way to ensure that translations are not biased is to use large or very large corpora, preferably covering distinct domains and produced by different speakers/authors. This is ensures variability, thus avoiding bias.

One corpus that satisfies these requirements is the Europarl corpus, the European Parliament Proceedings Parallel Corpus 1996-2011. From the website: "The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek."

The Europarl corpus has been extensively used in other SMT works, so its behavior is well known and specific tools exist to process it and prepare it for training phrase alignment models for tools such as the ones described below.

@@ Line 12: / Line 12: @@
 Schematically, we want a black box that takes a list of English discourse connectives and produces a corresponding list of candidate translations in a target language.
-[[image:textlink-1sts2016-blackbox.png]]
+[[image:textlink-1sts2016-blackbox.png|400px]]
 == The Approach ==

1st TextLink Training School - Machine translation techniques to induce multilingual lexica of discourse markers: Difference between revisions

From Wiki**3

Revision as of 09:11, 22 January 2016

Contents

The Motivation

The Approach