How to keep up with Language Dynamics?


Cristina Mota


  • 15:00, December 14, 2007
  • INESC-ID's Room 336 (Alves Redol)



Any information extraction system or system component, in particular named entity taggers, based on machine learning techniques needs data for the training stage, either labeled or not.

The data used to train and test the system are normally compatible in terms of some text parameter. For instance, testing the system with a language, genre or topic it was not trained for, not surprisingly, will result in poor performance. What if the time frame is different? Do texts present significant changes over time that will affect the system performance? Will the system thus become obsolete, i.e., will the performance decrease over time? In order to investigate this issue, we followed two complementary directions: one focusing on the analysis of the corpus; the other concentrating on the assessment of named entity tagger performance.

Our preliminary experiments with the corpus show that (i) the similarity between two texts decreases as the time gap between them increases, being more significant for some topics; (ii) in some cases, the texts over time become as different as two different topics (iii) the name overlap between two texts also decays as the texts become temporally more distant, the decay being more evident for some name categories. Regarding the system performance, we assessed the performance of an unsupervised named entity tagger based on co- training; the algorithm needs a small set of seeds and a set of unlabeled examples to bootstrap.

Our preliminary experiments with the tagger show that (i) the performance of the NER tagger decreases as the time gap between the training data (seeds and unlabeled data) and the test corpus increases; (ii) using the same seeds as in the previous experiment, the decrease is attenuated if a fraction of the unlabeled data is replaced by data within the time frame of the test set; (iii) the latter performance is comparable to the performance of the tagger trained with data exclusively within the time frame of the test set. This last result suggests that we may not need new labeled data; using new unlabeled data could suffice to prevent the performance decay over time. Given these preliminary results, our main contributions are (i) giving empirical evidence that the text time frame affects the performance of named entity recognition; (ii) proposing a methodology to avoid performance decay due to time.