Difference between revisions of "Machine Translation for Microblogs"

From HLT@INESC-ID

Line 11: Line 11:
  
 
PI: [[Isabel Trancoso]]  
 
PI: [[Isabel Trancoso]]  
* [[João Paulo Carvalho]]
 
 
* [[Wang Ling]]  
 
* [[Wang Ling]]  
 
* [[Anabela Barreiro]]  
 
* [[Anabela Barreiro]]  
Line 26: Line 25:
  
 
== Summary ==
 
== Summary ==
The MT4M project develops machine translation systems for content in microblogs, such as Twitter. This domain is characterized by creative use of language, dialectal lexemes, and informal register, which challenge traditional systems. For example, Google's English-Portuguese translation system translates the English sentence "ill cook it brotha!" (an informal variant of "I'll cook it, brother!" which the same translation system effectively translates) into the completely unintelligible "doente cozinhar brotha!" (roughly: "sick to cook brotha!"). The work on this project involves the development of a tweet normalizer that is capable of converting non-standard text into a standard text while preserving the meaning of the original tweet.
+
The MT4M project develops machine translation systems for content in microblogs, such as Twitter. This domain is characterized by creative use of language, dialectal lexemes, and informal register, which challenge traditional systems. Our earlier work towards this goal explored the fact that parallel data may be found in microblogs, in order to build a normalization model. In our recent work deals with the lexical sparsity that characterizes this domain by proposing character-based word representation models that explore orthographic properties of the language.  The advantages of the model go far beyond the machine translation task, generalizing to several other NLP tasks.

Revision as of 08:54, 14 January 2016

Cmu-pt-logo.png

Sponsored by: FCT (CMUP-EPB/TIC/0026/2013)
Start: January 2015
End: December 2015

INESC-ID Team

PI: Isabel Trancoso

UNBABEL TEAM

Carnegie Mellon University Team

Summary

The MT4M project develops machine translation systems for content in microblogs, such as Twitter. This domain is characterized by creative use of language, dialectal lexemes, and informal register, which challenge traditional systems. Our earlier work towards this goal explored the fact that parallel data may be found in microblogs, in order to build a normalization model. In our recent work deals with the lexical sparsity that characterizes this domain by proposing character-based word representation models that explore orthographic properties of the language. The advantages of the model go far beyond the machine translation task, generalizing to several other NLP tasks.