Difference between revisions of "Language and Dialect Identification"

From HLT@INESC-ID

 
Line 1: Line 1:
 
Spoken language identification has been the topic of an MSc thesis [Caseiro, 98] which explored phonotactic features. Our approach was based on an architecture similar to double bigram decoding [Navratil 99]. Our system has been trained and tested on the SPEECHDAT-M database: 9 phonetically rich sentences from 1000 speakers in 6 European languages - English, French, German, Italian, Portuguese and Spanish. The confusion results were in agreement with the known proximity between language pairs. The overall identification rate was 83.4%, and the performance increased, as expected, with the duration of the sentences. The system  is easily extendable to other languages, since it only requires audio data in these languages.
 
Spoken language identification has been the topic of an MSc thesis [Caseiro, 98] which explored phonotactic features. Our approach was based on an architecture similar to double bigram decoding [Navratil 99]. Our system has been trained and tested on the SPEECHDAT-M database: 9 phonetically rich sentences from 1000 speakers in 6 European languages - English, French, German, Italian, Portuguese and Spanish. The confusion results were in agreement with the known proximity between language pairs. The overall identification rate was 83.4%, and the performance increased, as expected, with the duration of the sentences. The system  is easily extendable to other languages, since it only requires audio data in these languages.
 +
 +
The identification of varieties of Portuguese is a current research topic.

Revision as of 13:03, 29 June 2006

Spoken language identification has been the topic of an MSc thesis [Caseiro, 98] which explored phonotactic features. Our approach was based on an architecture similar to double bigram decoding [Navratil 99]. Our system has been trained and tested on the SPEECHDAT-M database: 9 phonetically rich sentences from 1000 speakers in 6 European languages - English, French, German, Italian, Portuguese and Spanish. The confusion results were in agreement with the known proximity between language pairs. The overall identification rate was 83.4%, and the performance increased, as expected, with the duration of the sentences. The system is easily extendable to other languages, since it only requires audio data in these languages.

The identification of varieties of Portuguese is a current research topic.