We consider the problem of fully unsupervised learning of part-of-speech tags from unlabeled text, without assuming a word-tag dictionary. The standard Hidden Markov Model (HMM) fit via Expectation Maximization (EM) performs quite poorly, due in large part to the weakness of its inductive bias and excessive model capacity.
We address these problems by reducing its capacity via parametric and non-parametric constraints: eliminating parameters for rare words, adding morphological and orthographic features and enforcing word-tag association sparsity. We propose a simple model and an efficient learning algorithm, which are not much more complex than training using standard EM.
Our experiments on six languages (Bulgarian, Danish, English, Portuguese, Spanish, Turkish) achieve dramatic improvements over state-of-the-art results: 11% average absolute increase in aligned tagging accuracy.