Creating the "Touching the void" effect in a heterogenous corpus.

From HLT@INESC-ID

Paula Vaz Lobo

Date

  • 15:00, Friday, July 9th, 2010
  • Room 336

Speaker

Abstract

In this paper we present a content- and similarity-based method to search documents that are forgotten in the long tail, thus creating the "Touching the void" effect.

For this experiment, the long tail contains a heterogeneous set fairy tales, scientific papers, and short news. The method is similarity-based and extracts stylometric and semantic features from documents, creating vector spaces for each type of features or combinations of features. Vectors are compared using the cosine measure and the top-n are saved in a similarity matrix. The method, then, looks to the user profile and merges the lists of documents similar to the documents in the user profile and recommends the top-n most similar documents in the merged list.

Best results were achieved when using word length frequencies (stylometric) with a precision of 0.95 and a recall of 0.92, and vectors of lemmas and 4-grams concepts (semantic features from LSA, k = 100) with precision of 0.95 and recall of 0.95.