Language and Dialect Identification: Difference between revisions

From HLT@INESC-ID

No edit summary
No edit summary
 
Line 1: Line 1:
Spoken language identification has been the topic of an MSc thesis [Caseiro, 98] which explored phonotactic features. Our approach was based on an architecture similar to double bigram decoding [Navratil 99]. Our system has been trained and tested on the SPEECHDAT-M database: 9 phonetically rich sentences from 1000 speakers in 6 European languages - English, French, German, Italian, Portuguese and Spanish. The confusion results were in agreement with the known proximity between language pairs. The overall identification rate was 83.4%, and the performance increased, as expected, with the duration of the sentences. The system  is easily extendable to other languages, since it only requires audio data in these languages.
Spoken language identification has been the topic of an MSc thesis [Caseiro, 98] which explored phonotactic features. Our approach was based on an architecture similar to double bigram decoding [Navratil 97]. Our system has been trained and tested on the SPEECHDAT-M database: 9 phonetically rich sentences from 1000 speakers in 6 European languages - English, French, German, Italian, Portuguese and Spanish. The confusion results were in agreement with the linguistic proximity between language pairs. The overall identification rate was 83.4%, and the performance increased, as expected, with the duration of the sentences. The system  is easily extendable to other languages, since it only requires audio data in these languages.


The identification of varieties of Portuguese is a current research topic.
The identification of varieties of Portuguese is a current research topic.

Latest revision as of 13:04, 29 June 2006

Spoken language identification has been the topic of an MSc thesis [Caseiro, 98] which explored phonotactic features. Our approach was based on an architecture similar to double bigram decoding [Navratil 97]. Our system has been trained and tested on the SPEECHDAT-M database: 9 phonetically rich sentences from 1000 speakers in 6 European languages - English, French, German, Italian, Portuguese and Spanish. The confusion results were in agreement with the linguistic proximity between language pairs. The overall identification rate was 83.4%, and the performance increased, as expected, with the duration of the sentences. The system is easily extendable to other languages, since it only requires audio data in these languages.

The identification of varieties of Portuguese is a current research topic.