Robust Speaker Diarization for Meetings


Xavier Anguera Miró
Xavier Anguera Miró

Xavier Anguera Miró (Ing. [MS]. 2001 UPC University, Dr. [PhD] 2006 UPC University, with a thesis titled "Robust Speaker Diarization for Meetings".

From 2001 until 2004 he was with Panasonic speech technology lab, working on speech synthesis and speaker verification. From September 2004 until September 2006 he was visiting ICSI (International Computes Science Institute) where he focused his research on speaker diarization for meetings. He is currently with Telefónica I+D pursuing research on speaker related analysis and also actively participating in Spanish and European projects. His research interest cover (but are not restricted to) the areas of speaker recognition and automatic indexing of acoustic data.

Addresses: www mail


  • 15:00, Friday, June 1, 2007
  • 3rd floor meeting room



The goal of speaker diarization is to determine when each participant speaks in a recording. Such information is extensively used in ASR systems (for example VTLN or in speaker adaptation) and for speaker indexing systems. It is a part of the ongoing Rich Transcription (RT) evaluations organized by NIST.

In recent years the increasing interest in speech/video analysis for the meetings environment (NIST's RT05s and RT06s, AMI-DA-, CHIL and IM2 projects) made it necessary to address the possibility of having several microphones recording synchronously. These can be either organized in microphone clusters or spread across the room in unknown locations.

This presentation will cover the basics of what speaker diarization is and the implementation proposed as part of the author’s PhD. The system presented was built while at the International Computer Science Institute (ICSI) for speaker diarization in the meeting environment and is has been used for participation in the NIST RT evaluations since 2005. It is based on a mono channel diarization system originally created for broadcast news diarization, with a preprocessing step based on the delay&sum algorithm that makes use of the multiple channels available for processing.

The later part of the talk will introduce the efforts recently started in the speaker and audio indexing area in Telefónica R&D. Its impulse has mainly come due to the Spanish I3media Cenit project which started this year. Its objectives will be described, as well as the lines of research taken up to this point.