Extracting Parallel Data from Microblog Messages

From HLT@INESC-ID

Wang Ling

Date

  • 15:00, Friday, January 4th, 2013
  • Room 336

Speaker

Abstract

We present a novel method for extracting parallel data from microblog messages. In contrast with previously described methods that detect parallel documents, our approach finds parallel segments within the same document. We demonstrate our technique’s applicability by extracting a large number of parallel Chinese-English sentence pairs from Sina Weibo, the Chinese counterpart of Twitter. We evaluate the quality of our automatic method using a corpus of hand-labeled examples. Used in a Chinese-English machine translation system, the automatically extracted parallel yields text substantial improvements on microblog message translation, more than doubling the baseline BLEU score relative to a system that uses existing parallel data resources.