Feature extraction for content-based recommendation - Mining the long tail

From HLT@INESC-ID

Paula Vaz Lobo

Date

  • 14:30, Wednesday, March 9th, 2010
  • Room 336

Speaker

Abstract

The large amount of available items for consumption surpasses our processing capabilities. New content (books, news, music, video, etc.) is published every day, highly exceeding our capacity to make informed choices. The items that we do not know become potentially useless, because we are not aware of its existence and cannot specifically search for them.

Current recommendation systems try to predict what we want to consume. Nevertheless, quite often tend to recommend popular items, because they are mostly based on ratings. This phenomenon shapes the consumer curve as a Pareto's distribution placing popular rated items in the ``head (the first 20% of the total items) and the unpopular unrated items in the ``long tail (the rest 80%). Items in the long tail have a recognized interest for smaller groups of people. However, current recommendation systems are failing to reveal the unpopular items, because of the rating scarcity. There is a need to assist people finding interesting unrated items in the long tail.

In this thesis we explore textual features of documents in long tail. We explore document content to find similar documents using a top-N recommendation algorithm. We use semantic similarity (documents about the same subjects) as well as stylometric similarity (documents with similar types of writing style) to find documents that are closer to user preferences. Document similarity is measured using documents semantic and stylometric features. The combination of these two features type can improve recommendations novelty and help people find interesting documents in the long tail.