Text-Independent Cross-Language Voice Conversion for Speech-to-Speech Translation

From HLT@INESC-ID

Revision as of 23:14, 14 November 2006 by David (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
David Sündermann
David Sündermann received his M.Sc. in EE (with Distinction) from the Dresden University of Technology in 2002 and started his PhD project at the RWTH Aachen (Germany) with Hermann Ney. In 2003, he received a PhD Fellowship by Siemens Corporate Technology in Munich (Germany) and relocated to Barcelona (Spain) in 2004, working as research staff member at the Technical University of Catalonia with Antonio Bonafonte. In 2005, he was visiting scientist at the University of Southern California in LA (USA) with Shri Narayanan and, later this year, at the Columbia University in NYC (USA) with Julia Hirschberg. He has written more than 25 papers and holds a patent on voice conversion.

Date

  • November 17, 2006
  • Location: Room 336

Speaker

  • David Sündermann (Universitat Politècnica de Catalunya)

Abstract

For applications like multi-user speech-to-speech translation, it is helpful to individualize the output voice to make voices distinguishable. Ideally, this should be done by applying the input speaker's voice characteristics to the output speech.

In general, a speech-to-speech translation system consists of three main modules: speech recognition, text translation, and speech synthesis.

Since the latter, the speech synthesis module, normally is based on a large speech corpus of a professional speaker manually corrected and carefully tuned, the output voice characteristics are static. This is overcome by a fourth module, the voice conversion unit, which processes the synthesizer's speech according to the input voice characteristics.

Due to the nature of speech-to-speech translation, input and output voices use different languages leading to the following two challenges:

  • As opposed to state-of-the-art voice conversion, whose statistical parameter training is based on parallel utterances of both involved speakers (text-dependent approach), here we have to rely on text-independent parameter training: There is no way to produce parallel utterances in different languages.
  • Most voice conversion techniques estimate conversion functions that depend on the phonetic class, either explicitly (e.g. using CART) or implicitly (e.g. using GMM). However, considering different languages, we face different phoneme sets that make it hard to find conversion functions for phonetic units, which are not covered by the other phoneme set.

In this talk, I present text-independent voice conversion techniques that are cross-language portable and aim at solving these challenges. In this context, I will

  • introduce a speech alignment technique based on unit selection dealing with non-parallel speech
  • and show that vocal tract length normalization, which is applied to convert the source voice towards the target, can be directly applied to the time frames without the detour through frequency domain.

The techniques' performance is assessed on several multi-lingual corpora in the framework of subjective evaluations. In addition to the evaluation results, speech samples will be used to illustrate the discussed techniques' effectiveness.