Extracting Parallel Data from Microblog Messages

Wang Ling

Date

15:00, Friday, January 4^th, 2013
Room 336

Speaker

Wang Ling

Abstract

We present a novel method for extracting parallel data from microblog messages. In contrast with previously described methods that detect parallel documents, our approach finds parallel segments within the same document. We demonstrate our technique’s applicability by extracting a large number of parallel Chinese-English sentence pairs from Sina Weibo, the Chinese counterpart of Twitter. We evaluate the quality of our automatic method using a corpus of hand-labeled examples. Used in a Chinese-English machine translation system, the automatically extracted parallel yields text substantial improvements on microblog message translation, more than doubling the baseline BLEU score relative to a system that uses existing parallel data resources.

Extracting Parallel Data from Microblog Messages

From HLT@INESC-ID

Date

Speaker

Abstract