Speaker Diarization and Tracking in Smart-Room Environments and Broadcast News


Jordi Luque
Jordi Luque
Jordi Luque received the Electrical Engineering degree from the Technical University of Catalonia (UPC), Barcelona, Spain, in 2005. He is currently working towards the PhD. degree at the Research Center for Language and Speech Technology and Applications (TALP) at the UPC. His research interests are related to the field of speech processing. Specifically, he has worked on the speaker identification and verification problems, diarization of meetings and broadcast news and automatic speech recognition. He is focusing his work on performing speaker diarization and tracking in smart-room environments combining information from other audio and video modalities.
Addresses: www mail


  • 14:00, Friday, November 20th, 2009
  • Room 4


  • Jordi Luque, Research Center for Language and Speech Technology and Applications (TALP), UPC, Spain


A continuous speaker identification in real conditions is in fact further to be a solved problem. In real situations which present a typical multi-speaker environment with a continuous interaction between speakers it becomes a really hard problem. For example, a meeting or a conference, where several persons interact with each other to exchange information about a common topic, the identification system face a lot of issues which degrades the speaker identification. Non-speech events such as steps, chair moving, laughs, key typing... the speech overlapping between speakers, reverberation effects, the environment condition among different recordings or simply the accurate detection of the speaker turn are some examples that can be reeled off from a long list.

It can be found two main tasks in the literature related the identification of people across time. The speaker diarization task generally answers to the question "Who spoke when?" and it is performed without any prior knowledge of the identity of the speakers in the audio stream or how many are there. The output of the diarization are labels which identify regions in the recordings from the same speaker without take care about "who is?" The task of finding such speaker-defined regions was first introduced in the Rich Transcription project in the "Who spoke when" evaluations by NIST. In contrast, speaker tracking task attempts to put a name to such labels, identifying the speakers from a set of known target speakers. The identification and verification of the speaker are an implicit task of these systems. The speaker identification compares the obtained segments with a set of target speakers and discerns whom is the more likelihood speaker that should produce the speech. The task of identifying the regions associated with a particular speaker is known as a speaker tracking task and was defined during a 1999 NIST Speaker Recognition evaluation.

Such information is of high interest for several speech and audio applications. Automatic speaker indexing of audio documents or just creating more readable transcriptions are some straightforward examples. Furthermore, the speaker identity can also be applied, for example, together with audio source localization for improving a speaker localization tracker or in automatic speech recognition systems, such labeling can be useful for unsupervised speaker adaptation, improving the performance of speech recognition in large-vocabulary continuous-speech-recognition systems.