Language identification on broadcast news (seminar)



  • 14:30, January 19, 2007
  • 3rd floor meeting room (presentation in English)



At L2F, numerous researches has been achieved on Automatic Speech Recognition (ASR) in Portuguese. The ASR system is currently applied to transcribe broadcast news extracted from a public national channel, the Telejornal on RTP1. The system is now working on an everyday basis and results of the transcription of the last broadcasted evening news is available at

However, one of the problems encountered by the ASR system is the presence of different languages: many interviews are subtitled in Portuguese, while the audio remains in the original language which generates errors. Therefore, to reduce the error rate, the system need to know if the spoken language is really Portuguese or another language. Furthermore, if several ASR systems are available for the most frequent other languages (like English), we will also be able not only to identify the spoken language but also to select the appropriate ASR system.

In this presentation, I will explain the work I have done at L2F on language identification. After a brief review on automatic language identification systems, I will describes the approaches I implemented during my stay at L2F: the first approach is based on prosodic features, the second one uses acoustic properties of languages, and the last one is based on phonotactics. At last, a basic fusion is described to get an idea of how much improvement can be gained while using all informations together.