Language and Dialect Identification

From HLT@INESC-ID

Revision as of 13:03, 29 June 2006 by Imt (talk | contribs)

Spoken language identification has been the topic of an MSc thesis [Caseiro, 98] which explored phonotactic features. Our approach was based on an architecture similar to double bigram decoding [Navratil 99]. Our system has been trained and tested on the SPEECHDAT-M database: 9 phonetically rich sentences from 1000 speakers in 6 European languages - English, French, German, Italian, Portuguese and Spanish. The confusion results were in agreement with the known proximity between language pairs. The overall identification rate was 83.4%, and the performance increased, as expected, with the duration of the sentences. The system is easily extendable to other languages, since it only requires audio data in these languages.

The identification of varieties of Portuguese is a current research topic.