500M-KPCrowd is a corpus made of 500 news articles (50 stories for each of the 10 categories selected) manually annotated with Key Phrases by 20 Amazon's Mechanical Turk workers.

The news articles were retrieved from online news sources.


  • Number of stories: 450 / 50 (Train / Test)
  • Average number of Amazon Mechanical Turk workers per news: 20
  • Number of Topics: 10
  • Average Number of Key Phrases per news story: 40

Further Reading

The corpus is free for non-commercial use.

Please cite this paper if you write any paper using the data below:

Luis Marujo and Anatole Gershman and Jaime Carbonell and Robert Frederking and João Paulo da Silva Neto, Supervised Topical Key Phrase Extraction of News Stories using Crowdsourcing, Light Filtering and Co-reference Normalization, 8th International Conference on Language Resources and Evaluation (LREC 2012), May. 2012 , ELRA. pdf bibTeX