Extracting Parallel Data from Microblog Messages


Wang Ling


  • 15:00, Friday, January 4th, 2013
  • Room 336



We present a novel method for extracting parallel data from microblog messages. In contrast with previously described methods that detect parallel documents, our approach finds parallel segments within the same document. We demonstrate our technique’s applicability by extracting a large number of parallel Chinese-English sentence pairs from Sina Weibo, the Chinese counterpart of Twitter. We evaluate the quality of our automatic method using a corpus of hand-labeled examples. Used in a Chinese-English machine translation system, the automatically extracted parallel yields text substantial improvements on microblog message translation, more than doubling the baseline BLEU score relative to a system that uses existing parallel data resources.