Computational grids enable the sharing and aggregation of geographically distributed resources for solving large-scale and data intensive problems. In this project we propose the implementation of a software architecture for building component-based applications targeted for the computational processing of the portuguese language that can be executed in a computational grid.
The first task of the project is the definition of an architectural model for building grid-enabled component-based natural language engineering applications. The idea is to create a flexible application design environment, allowing disparate components to be combined to suit the overall application functionality.
Managing multi-component architectures is a challenging task and their complexity can be a serious hurdle when trying to bring together heterogeneous components. This problem occurs at various levels: from file-format handling or network-level communication to interaction between modules in a large application. The adopted model will be based on the work developed for the Galinha system (Matos, 2003) and on the architecture proposed by Hughes and Bird (2003).
The second task is the encapsulation of the re-usable components to comply with the requirements of the model defined in the previous task. Language engineering applications are constructed out of several processing components each responsible for a specialized task. Typical components include a phone segmenter, tagging, lexicon access, parsing, etc. These components are usually highly parameterized and some of them have to be trained on very large datasets. Discovering optimal parameterizations is both data and computationally intensive. In this task a set of pre-existing components will be adapted to the proposed framework.
The third task deals with the description and design of a multi-component application. Having a set of wrapped components we need to combine them to create a range of multi-component applications which can be executed over the distributed GRID infrastructure. A portal based approach will be used for application assembly.
The fourth task is the development of the interface to GRID services. We will adopt the approach provided by Globus (Globus Project, 2003). The Globus model identifies a distributed set of resources (the GRID) that the applications can use. We will also use a resource broker to manage the GRID interaction. The resource broker is responsible for the matchmaking between job requests and the distributed resources provided through the gatekeepers.
The final task is the evaluation of the proposed solution. A set of component-based applications will be selected to be used in experiments to evaluate the performance of the resulting system.
The main objective of this project is to create a framework for high performance NLE computing on a computational GRID by extending the structure of the Galinha system (Matos, 2003).
The Galinha system was developed by the Spoken Language Laboratory (L²F) of INESC-ID in an effort to simplify the creation of NLE applications: an application is built through a web interface by creating a service chain from a pool of re-usable components. The current version includes components for morphological analysis, part-of-speech disambiguation and syntatic analysis, etc. In this project we plan to also include modules for speech processing tasks.
In this project we will extend the Galinha architecture to include an interface to GRID services so that the components and data can be geographically and organizationally distributed. This requires a set of standard middleware protocols like the ones provided by the Globus toolkit to handle security, information discovery, resource and data management.