How to keep up with language dynamics? A case study on Named Entity Recognition


Cristina Mota


  • April 07, 2006



Over the last years NLP researchers have made an effort in building large amounts of annotated corpora (e.g. BNC, Brown Corpus). Depending on the targeted task, and text language, domain and genre, compatible corpora are used to both train machine learning systems and guide the hand-written rule developers.

What about the text time frame? Should the texts used to develop a system be also temporally close to the texts being analyzed? The main goal of our dissertation is to study the influence that the text time frame has in the performance of Information extraction (IE) systems, in particular in Named Entity Recognition.

In a certain way it is obvious, even though as far as we know has not yet been measured, that names appearing in texts vary over time, as well as the entities they refer to. Do the contexts where they occur also vary? Are these variations significant enough to influence the performance of a system constructed based on texts with different time frames from the texts in which it is going to perform information extraction?

We will show the first steps given towards the clarification of these issues, by presenting some preliminary experiments in the political section of the Portuguese journalistic corpus CETEMPublico (this 180 million word corpus comprises 7 years of news articles from 1991 to 1998).

Given that this corpus has not yet been annotated with proper names, we will also describe the methodology followed to obtain such annotation by semi-automated means.