Difference between revisions of "NLE GRID (Natural Language Engineering on a Computational Grid)"

From HLT@INESC-ID

 
(12 intermediate revisions by the same user not shown)
Line 2: Line 2:
 
== Summary ==
 
== Summary ==
  
Computational GRIDs enable the sharing and aggregation of geographically distributed resources for solving large-scale and data intensive problems. In this project we propose the implementation of a software architecture for building component-based applications targeted for the computational processing of the portuguese language that can be executed in a computational GRID.  
+
Computational grids enable the sharing and aggregation of geographically distributed resources for solving large-scale and data intensive problems. In this project we propose the implementation of a software architecture for building component-based applications targeted for the computational processing of the portuguese language that can be executed in a computational grid.  
  
The first task of the project is the definition of an architectural model for building GRID-enabled component-based natural language engineering applications. The idea is to create a flexible application design environment, allowing disparate components to be combined to suit the overall application functionality.  
+
The first task of the project is the definition of an architectural model for building grid-enabled component-based natural language engineering applications. The idea is to create a flexible application design environment, allowing disparate components to be combined to suit the overall application functionality.  
  
 
Managing multi-component architectures is a challenging task and their complexity can be a serious hurdle when trying to bring together heterogeneous components. This problem occurs at various levels: from file-format handling or network-level communication to interaction between modules in a large application. The adopted model will be based on the work developed for the Galinha system (Matos, 2003) and on the architecture proposed by Hughes and Bird (2003).  
 
Managing multi-component architectures is a challenging task and their complexity can be a serious hurdle when trying to bring together heterogeneous components. This problem occurs at various levels: from file-format handling or network-level communication to interaction between modules in a large application. The adopted model will be based on the work developed for the Galinha system (Matos, 2003) and on the architecture proposed by Hughes and Bird (2003).  
Line 14: Line 14:
 
The fourth task is the development of the interface to GRID services. We will adopt the approach provided by Globus (Globus Project, 2003). The Globus model identifies a distributed set of resources (the GRID) that the applications can use. We will also use a resource broker to manage the GRID interaction. The resource broker is responsible for the matchmaking between job requests and the distributed resources provided through the gatekeepers.  
 
The fourth task is the development of the interface to GRID services. We will adopt the approach provided by Globus (Globus Project, 2003). The Globus model identifies a distributed set of resources (the GRID) that the applications can use. We will also use a resource broker to manage the GRID interaction. The resource broker is responsible for the matchmaking between job requests and the distributed resources provided through the gatekeepers.  
  
 +
<!--[[Image:GridLoad.png|right]]-->
 
The final task is the evaluation of the proposed solution. A set of component-based applications will be selected to be used in experiments to evaluate the performance of the resulting system.
 
The final task is the evaluation of the proposed solution. A set of component-based applications will be selected to be used in experiments to evaluate the performance of the resulting system.
  
Line 24: Line 25:
 
In this project we will extend the Galinha architecture to include an interface to GRID services so that the components and data can be geographically and organizationally distributed. This requires a set of standard middleware protocols like the ones provided by the Globus toolkit to handle security, information discovery, resource and data management.
 
In this project we will extend the Galinha architecture to include an interface to GRID services so that the components and data can be geographically and organizationally distributed. This requires a set of standard middleware protocols like the ones provided by the Globus toolkit to handle security, information discovery, resource and data management.
  
== State of the Art ==
+
* [[NLE GRID Project - State of the Art and References|State of the Art]]
 
+
The use of distributed computing services in NLE is still at an early stage, compared to what has been achieved in areas like high energy physics and biology. In our view this is due to the lack of standardization and interoperability of most NLE tools.
+
 
+
Research laboratory like ours, that uses a considerable amount of NLE tools and modules, often face the problem of re-using these resources. These may have been produced in-house or they may be third-party modules. In either case, the task of managing them is not simple: for instance, some tool may be available but may be deemed to hard to reuse for a particular task, causing the redevelopment of a similar tool.
+
 
+
If reuse is a problem, the contact between old tools and new users is also a critical issue. The problem here is often in terms of the time required to acquire the necessary expertise to fully and productively use some resource.
+
 
+
To address the above issues, Matos (2003) proposed the Galinha system, a web-based user interface for building modular applications. The interface allows new users and non-specialists to assemble and test complex prototypes: the only requirement is a clear understanding of the meaning of the data used by each module - a requirement much less stringent than understanding the modules themselves.
+
 
+
The infrastructure used to support the interface is a partial implementation of the theoretical interconnection model proposed in Matos (2002). In the first stage, the Galaxy Communicator system (MIT, 2001) was selected to provide messaging support for the infrstructure's message exchanges.
+
 
+
A similar solution was proposed by Curran (2003) using of a Generative Programming approach for the development of NLE applications by the composition of elementary components like sentence boundary detectors, POS taggers, chunkers and named entity recognizers. This re-usable components can be optimized for both performance and high runtime efficiency. These components are encapsulated with standard interfaces for gluing them together into new tools. Curran also suggests the use of a web services interface to allow the composition of components developed by different researchers running in different locations.
+
 
+
Hughes and Bird (2003) proposed the extension of the component-based architecture to integrate interfaces with computational GRID services. In this project we plan to build on that proposal and to integrate it into the Galinha system.
+
 
+
A computational GRID allows for large-scale analysis, distributed resources and processing, in addition to engendering new models for collaboration and application development. Foster et al (2001, 2002) provides a physiological and an anatomical overview of GRID computing services and provides foundational architectures for application development in the GRID space.
+
 
+
To benefit from the use of a computational GRID, NLE applications need to subscribe an architectural model that allows automated discovery of components and data, a flexible way to incorporate the different components in a working application, coordination of execution and storage of results. The goal is to allow NLE researchers to design their applications for a computational GRID without requiring expertise in GRID computing.
+
  
 
== Publications ==
 
== Publications ==
Line 65: Line 48:
 
* Ricardo Daniel Ribeiro, David Martins de Matos, Bruno Oliveira, Carlos Pona, Luísa Coheur, Creating and Maintaining Multi-purpose Lexical Knowledge, July 2006
 
* Ricardo Daniel Ribeiro, David Martins de Matos, Bruno Oliveira, Carlos Pona, Luísa Coheur, Creating and Maintaining Multi-purpose Lexical Knowledge, July 2006
  
== References ==
+
== Demos ==
  
* J. Curran. Blueprint for a High Performance NLP Infrastructure. Proceedings of the HLT-NAACL 2003 Workshop on Software Engineering and Architecture of Language Technology Systems, Edmonton, Canada pp. 39-44. ACL. 2003.
 
* H. Cunningham, Y. Wilks, and R. J. Gaizauskas. GATE - a General Architecture for Text Engineering. In Proc. of the 16th Conf. on Computational Linguistics (COLING96), Copenhagen, 1996.
 
* Rajkumar Buyya, David Abramson and Jonathan Giddy. NimrodG Resource Broker for Service Oriented Grid Computing. IEEE Distributed Systems Online, Vol. 2, N. 7, November 2001.
 
* Ian Foster, Carl Kesselman, J Nick, Steven Tuecke. The Physiology of the Grid - An Open Grid Services Architecture for Distributed Systems Integration. Global Grid Forum, June 22, 2002.
 
* Ian Foster, Carl Kesselman, Steven Tuecke. The anatomy of the grid: Enabling scalable virtual organisations. International Journal of Supercomputer Applications, 15(3), 2001
 
* Globus Project. The Globus Project. University of Chicago
 
* David Graff. English Gigaword. Linguistic Data Consortium, 2002
 
* Baden Hughes and Steven Bird. A Grid Based Architecture for High-Performance NLP. Natural Language Engineering, 2003.
 
* Ricardo Ribeiro, David M. de Matos, and Nuno Mamede. How to integrate data from different sources. Proc. of the INTERA Workshop "A Registry of Linguistic Data Categories within an Integrated Language Resources Repository Area". ELRA. May 2004.
 
* F. Batista and Nuno Mamede. Flexible Module for Shallow Parsing, Using Preferences. TASHA'2003 - Workshop on Tagging and Shallow Processing of Portuguese, Oct. 2003.
 
* Ricardo Ribeiro and Nuno Mamede and Isabel Trancoso. Reusing Linguistic Resources: a Case Study in Morphosyntactic Tagging. TASHA'2003 - Workshop on Tagging and Shallow Processing of Portuguese, Oct. 2003.
 
* R. Ribeiro, L. Oliveira, I. Trancoso. Using Morphossyntactic Information in TTS Systems: Comparing Strategies for European Portuguese. Proc. PROPOR'2003 Faro, Portugal, June 2003
 
* S. Paulo, L. Oliveira. Multilevel Annotation Of Speech Signals Using Weighted Finite State Transducers. Proc. 2002 IEEE Workshop on Speech Synthesis Santa Monica, California, September 2002
 
* D. M. de Matos, A. Mateus, J. Graca, and N. J. Mamede. Empowering the User: a Data-oriented Application-Building Framework. In Adj. Proc. of the 7th ERCIM Workshop "User Interfaces for All", pages 37-44, Chantilly, France, October 2002.
 
* M. C. Viana and Luis C. Oliveira and A. I. Mata. Prosodic Phrasing: Machine and Human Evaluation International Journal of Speech Technology, 6(1), pp. 83-94, Jan. 2003, Kluwer Academic Publishers.
 
* David M. de Matos, Ricardo Ribeiro, and Nuno Mamede. Rethinking Reusable Resources. Proc. of the International Conference on Language Resources and Evaluation, LREC'2004, ELRA. May 2004.
 
* Sergio Manuel Gaspar Ferreira Paulo and Luis C. Oliveira. DTW-based Phonetic Alignment Using Multiple Acoustic Features EUROSPEECH'2003 - 8th European Conference on Speech Communication and Technology, Sep. 2003
 
* David Matos and Joana Paulo and Nuno Mamede. Managing Linguistic Resources and Tools. Proc. of the 6th Intl. Workshop, PROPOR 2003, Jun. 2003 , pp. 135--142 , Springer-Verlag, Heidelberg.
 
 
<!--
 
<!--
 
== Repercussions ==
 
== Repercussions ==
Line 96: Line 61:
 
In our view, this project can lay the ground-work for an European research project that will extend the portuguese language components of this project with components built for other languages.
 
In our view, this project can lay the ground-work for an European research project that will extend the portuguese language components of this project with components built for other languages.
 
-->
 
-->
 +
 +
[[Image:barra_posc.gif]]
 +
 
[[category:Research]]
 
[[category:Research]]
 
[[category:Projects]]
 
[[category:Projects]]
 
[[category:National Projects]]
 
[[category:National Projects]]

Latest revision as of 09:18, 20 June 2008

Summary

Computational grids enable the sharing and aggregation of geographically distributed resources for solving large-scale and data intensive problems. In this project we propose the implementation of a software architecture for building component-based applications targeted for the computational processing of the portuguese language that can be executed in a computational grid.

The first task of the project is the definition of an architectural model for building grid-enabled component-based natural language engineering applications. The idea is to create a flexible application design environment, allowing disparate components to be combined to suit the overall application functionality.

Managing multi-component architectures is a challenging task and their complexity can be a serious hurdle when trying to bring together heterogeneous components. This problem occurs at various levels: from file-format handling or network-level communication to interaction between modules in a large application. The adopted model will be based on the work developed for the Galinha system (Matos, 2003) and on the architecture proposed by Hughes and Bird (2003).

The second task is the encapsulation of the re-usable components to comply with the requirements of the model defined in the previous task. Language engineering applications are constructed out of several processing components each responsible for a specialized task. Typical components include a phone segmenter, tagging, lexicon access, parsing, etc. These components are usually highly parameterized and some of them have to be trained on very large datasets. Discovering optimal parameterizations is both data and computationally intensive. In this task a set of pre-existing components will be adapted to the proposed framework.

The third task deals with the description and design of a multi-component application. Having a set of wrapped components we need to combine them to create a range of multi-component applications which can be executed over the distributed GRID infrastructure. A portal based approach will be used for application assembly.

The fourth task is the development of the interface to GRID services. We will adopt the approach provided by Globus (Globus Project, 2003). The Globus model identifies a distributed set of resources (the GRID) that the applications can use. We will also use a resource broker to manage the GRID interaction. The resource broker is responsible for the matchmaking between job requests and the distributed resources provided through the gatekeepers.

The final task is the evaluation of the proposed solution. A set of component-based applications will be selected to be used in experiments to evaluate the performance of the resulting system.

Objectives

The main objective of this project is to create a framework for high performance NLE computing on a computational GRID by extending the structure of the Galinha system (Matos, 2003).

The Galinha system was developed by the Spoken Language Laboratory (L²F) of INESC-ID in an effort to simplify the creation of NLE applications: an application is built through a web interface by creating a service chain from a pool of re-usable components. The current version includes components for morphological analysis, part-of-speech disambiguation and syntatic analysis, etc. In this project we plan to also include modules for speech processing tasks.

In this project we will extend the Galinha architecture to include an interface to GRID services so that the components and data can be geographically and organizationally distributed. This requires a set of standard middleware protocols like the ones provided by the Globus toolkit to handle security, information discovery, resource and data management.

Publications

International Conferences

  • Ricardo Daniel Ribeiro, David Martins de Matos, Extractive Summarization of Broadcast News: Comparing Strategies for European Portuguese, In Text, Speech and Dialogue, 10th International Conference, TSD 2007, Springer, vol. 4629, pages 115-122, September 2007
  • Fernando Batista, Nuno J. Mamede, Diamantino António Caseiro, Isabel Trancoso, A Lightweight on-the-fly Capitalization System for Automatic Speech Recognition , In Recent Advances in Natural Language Processing, vol. 1, September 2007

National Conferences

  • Ivo Anjo, David Martins de Matos, XMLaligner: Exploração de Corpora Paralelos, In XATA2007 — XML: Aplicações e Tecnologias Associadas, FCUL, February 2007
  • Bruno Oliveira, Carlos Pona, David Martins de Matos, Ricardo Daniel Ribeiro, Utilização de XML para Desenvolvimento Rápido de Analisadores Morfológicos Flexíveis, In XATA2006 - XML: Aplicações e Tecnologias Associadas, Universidade do Minho, February 2006

Technical Reports

  • David Martins de Matos, Tiago Luís, Ricardo Daniel Ribeiro, Natural Language Engineering on a Computational Grid (NLE-GRID) T1 - Architectural Model, January 2008
  • David Martins de Matos, Ricardo Daniel Ribeiro, Sérgio Paulo, Fernando Batista, Luísa Coheur, Joana Paulo Pardal, Natural Language Engineering on a Computational Grid (NLE-GRID) T2 - Encapsulation of Reusable Components, January 2008
  • David Martins de Matos, Ricardo Daniel Ribeiro, Natural Language Engineering on a Computational Grid (NLE-GRID) T2h - Encapsulation of Reusable Components: Lexicon Repository and Server, January 2008
  • Luis Marujo, Wang Lin, David Martins de Matos, Natural Language Engineering on a Computational Grid (NLE-GRID) T3 - Multi-Component Application Builder, January 2008
  • Tiago Luís, David Martins de Matos, Sérgio Paulo, Ricardo Daniel Ribeiro, Natural Language Engineering on a Computational Grid (NLE-GRID) T5 - Performance Experiments, January 2008
  • Ricardo Daniel Ribeiro, David Martins de Matos, Bruno Oliveira, Carlos Pona, Luísa Coheur, Creating and Maintaining Multi-purpose Lexical Knowledge, July 2006

Demos

Barra posc.gif