Extracting Parallel Data from Microblog Messages

From HLT@INESC-ID

The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
Wang Ling
Wang Ling

Date

  • 15:00, Friday, January 4th, 2013
  • Room 336

Speaker

Abstract

We present a novel method for extracting parallel data from microblog messages. In contrast with previously described methods that detect parallel documents, our approach finds parallel segments within the same document. We demonstrate our technique’s applicability by extracting a large number of parallel Chinese-English sentence pairs from Sina Weibo, the Chinese counterpart of Twitter. We evaluate the quality of our automatic method using a corpus of hand-labeled examples. Used in a Chinese-English machine translation system, the automatically extracted parallel yields text substantial improvements on microblog message translation, more than doubling the baseline BLEU score relative to a system that uses existing parallel data resources.