Improving Methods for Single-Label Text Categorization

From HLT@INESC-ID

Ana Cardoso Cachopo
Ana Cardoso Cachopo
Ana Cardoso Cachopo was born in Lisbon, Portugal in 1971.

Graduated in Software and Computer Engineering at IST - Instituto Superior Técnico in 1994. Got her MSc in Electrical and Computers Engineering also at IST in 1997, with a thesis on "Permissive Belief Revision". Got her PhD in Computer Science also at IST in 2007, with a thesis on "Improving Methods for Single-label Text Categorization". She does her research at GIA - Grupo de Inteligência Artificial and also - as an invited researcher - at INESC-ID's ALGOS group. She started teaching at IST in 1992 and she is now a Teaching/Research Assistant. She belong to the Artificial Intelligence Group (GIA) of the Department of Information Systems and Computer Science (DEI - Departamento de Engenharia Informática).

Addresses: www mail

Date

  • 14:00, Friday, October 12, 2007
  • 3rd floor meeting room, INESC-ID

Speaker

  • Ana Cardoso Cachopo, DEI, GIA, Algos - INESC-ID.

Abstract

As the volume of information in digital form increases, the use of Text Categorization techniques aimed at finding relevant information becomes more necessary.

To improve the quality of the classification, I propose the combination of different classification methods. The results show that KNNLSI, the combination of KNN with LSI, presents an average Accuracy on the five datasets that is higher than the average Accuracy of each original method. The results also show that SVMLSI, the combination of SVM with LSI, outperforms both original methods in some datasets. Having in mind that SVM is usually the best performing method, it is particularly interesting that SVMLSI performs even better in some situations.

To reduce the number of labeled documents needed to train the classifier, I propose the use of a semi-supervised centroid-based method that uses information from small volumes of labeled data together with information from larger volumes of unlabeled data for text categorization. Using one synthetic dataset and three real-world datasets, I provide empirical evidence that, if the initial classifier for the data is sufficiently precise, using unlabeled data improves performance. On the other hand, using unlabeled data actually degrades the results if the initial classifier is not good enough.

The dissertation includes a comprehensive comparison between the classification methods that are most frequently used in the Text Categorization area and the combinations of methods proposed.