Controlling Complexity in Part-of-Speech Induction

João Graça

Date

14:00, May 28th, 2010
Room 4

Speaker

João Graça

Abstract

We consider the problem of fully unsupervised learning of part-of-speech tags from unlabeled text, without assuming a word-tag dictionary. The standard Hidden Markov Model (HMM) fit via Expectation Maximization (EM) performs quite poorly, due in large part to the weakness of its inductive bias and excessive model capacity.

We address these problems by reducing its capacity via parametric and non-parametric constraints: eliminating parameters for rare words, adding morphological and orthographic features and enforcing word-tag association sparsity. We propose a simple model and an efficient learning algorithm, which are not much more complex than training using standard EM.

Our experiments on six languages (Bulgarian, Danish, English, Portuguese, Spanish, Turkish) achieve dramatic improvements over state-of-the-art results: 11% average absolute increase in aligned tagging accuracy.

Controlling Complexity in Part-of-Speech Induction

From HLT@INESC-ID

Date

Speaker

Abstract