Fairy tale corpus

From HLT@INESC-ID

Fairy tale corpus semantically organized and tagged.

About the Corpus

This fairy tale corpus is divided in semantically related clusters. Clusters overlap, i.e., each tale can be allocated to more than one cluster.

Fairy tales are written for children and its plot and language are simpler than tales written for adults. Fairy tales are also easily read and understood. Fairy tale sentences are shorter and emotions are well defined. A fairy tale corpus can be useful for emotion extraction, semantic role extraction, meaning extraction, recommendation, text classification, among others.

  • Number of stories: 453
  • Number of words: 908,174
  • Average words/story: 1891
  • Shortest story: 75
  • Longest story: 17,694
  • Clusters: 365

Using the Corpus

The corpus is free for non-commercial use. Please contact Paula Cristina Vaz for other uses.

If you use the corpus, please cite the following article:

  • Paula Vaz Lobo, David Martins de Matos, Fairy Tale Corpus Organization Using Latent Semantic Mapping and an Item-to-item Top-n Recommendation Algorithm, In Language Resources and Evaluation Conference - LREC 2010, European Language Resources Association (ELRA), Malta, May 2010

Downloads

Download the corpus: fairy-tales-corpus-map.tar.gz