Parallel Morphosyntactic Annotation

From HLT@INESC-ID

Tiago Luís

Date

  • 15:00, April 11, 2008
  • Room 336

Speaker

Abstract

Morphosyntactic annotation often deals with large collections of data that require significant resources, both in terms of space and processing time. Currently, space costs have declined, allowing growth and wealth of description of the data. On the other hand, this has increased the amount of data to process. Despite the fall of storage costs, the processing of these materials is, in general, a computation-heavy process that could take weeks to produce their results. Parallel computing allows us to solve these computation-heavy problems in less time but raises other issues: scheduling of execution of the program across machines, communication and synchronization between then and fault tolerance. The Hadoop framework is a platform that simplifies the creation and execution of applications that process vast amounts of data. The computation is divided into many small blocks of work that are processed where they are located, minimizing network consumption. I will present a framework that uses Hadoop and simplifies integration of NLP tools and their parallel execution. This framework produces language independent annotations in MAF (Morphosyntactic Annotation Framework) format, developed by ISO TC37 SC4, that maintains all the information produced by the different tools.