Natural language processing (NLP) is a subfield of artificial intelligence and linguistics that studies the problems inherent to the processing of natural language. This area deals with large collections of data that require significant resources, both in terms of space and processing time. Currently, persistent space costs have declined, allowing on the one hand, growth and wealth of description of the data, and on the other hand, increase of the amount of data to process. Despite the fall of persistent storage costs, the processing of these materials is a computation-heavy process and some algorithms continue to take weeks to produce their results.
One of the computation-heavy linguistic tasks is annotation. Annotation is the process of adding linguistic information to language data or the linguistic annotation itself. Moreover, when these tools are integrated, several problems related with information flow between these tools may arise. For example, a given tool may need an annotation previously produced by another tool but some of this linguistic information can be lost in conversions between the different tool data formats due to differences in expressiveness.
The developed framework simplifies the integration of independently existing NLP tools to form NLP systems without information losses between them. Also, it allows the development of scalable and language-independent NLP systems on top of the Hadoop framework, offering a friendly programming environment and a transparent handling of distributed computing problems, like fault tolerance and task scheduling. With this framework we achieved speedup values around 40 on a cluster with 80 cores.