Recent advances in language and speaker recognition (II): Compensation methods, the Joint Factor Analysis

From HLT@INESC-ID

Jordi Luque
Jordi Luque
Jordi Luque received the Electrical Engineering degree from the Technical University of Catalonia (UPC), Barcelona, Spain, in 2005. He is currently working towards the PhD. degree at the Research Center for Language and Speech Technology and Applications (TALP) at the UPC. His research interests are related to the field of speech processing. Specifically, he has worked on the speaker identification and verification problems, diarization of meetings and broadcast news and automatic speech recognition. He is focusing his work on performing speaker diarization and tracking in smart-room environments combining information from other audio and video modalities. And is currently working at the Spoken Language Systems Laboratory (L2F).
Addresses: www mail

Date

  • 15:00, Friday, January 29th, 2010
  • Room 336

Speaker

  • Jordi Luque, L2F and Research Center for Language and Speech Technology and Applications (TALP), UPC, Spain

Abstract

A considerable amount of promising methods for language and speaker recognition have been proposed in the most recent NIST language (LRE) and speaker (SRE) recognition evaluation workshops. In this talk we will focus on the problem of compensation to several sources of variability such as speaker or session and we will introduce the Joint Factor Analysis (JFA) modeling. We will give an explanation of the JFA model and a brief account of the algorithms needed to carry out a JFA of speakers and session variability in a training set in which each speaker is recorded over many different channels.

JFA is a model of speaker and session variability in Gaussian mixture models (GMM's) and it is capable of performing at least as well as fusions of multiple systems of other types. The JFA technique makes use of the super-vector form for modeling. That assumes that a speaker- and channel-dependent supervector (M) can be decomposed into a sum of two supervectors statistically independent, a speaker supervector (s) and a channel supervector (c). In addittion, JFA assumes that all speaker dependent supervectors are contained in the affine space defined by the eigenvoices, the directions of speaker variability, which generate the "speaker space". On another front, the channel variability is confined in the "channel space" defined by the eigenchannels.