Language and Dialect Identification

Spoken language identification has been the topic of an MSc thesis [Caseiro, 98] which explored phonotactic features. Our approach was based on an architecture similar to double bigram decoding [Navratil 97]. Our system has been trained and tested on the SPEECHDAT-M database: 9 phonetically rich sentences from 1000 speakers in 6 European languages - English, French, German, Italian, Portuguese and Spanish. The confusion results were in agreement with the linguistic proximity between language pairs. The overall identification rate was 83.4%, and the performance increased, as expected, with the duration of the sentences. The system is easily extendable to other languages, since it only requires audio data in these languages.

The identification of varieties of Portuguese is a current research topic.

Language and Dialect Identification

From HLT@INESC-ID