Recovering Capitalization and Punctuation Marks on Speech Transcriptions

From HLT@INESC-ID

Fernando Batista

Date

  • 15:00, Friday, February 13th, 2009
  • Room 4

Speaker

Abstract

Enormous quantities of digital audio and video data are being daily produced by TV stations, radio, and other media organizations. Automatic Speech Recognition systems are now being applied to such information sources in order to provide them with additional knowledge for applications such as: indexing, subtitling, translation and production of multimedia content. Nonetheless, Automatic Speech Recognition output usually consists of raw text, often in lower-case format, without any punctuation information, where all numbers are written as words, and where different types of disfluencies can be found. The ASR output is still useful for many applications, however, tasks such as subtitling and multimedia content production benefit from additional information, which can be provided by rich transcription applications. In general, enriching the speech output aims to improve legibility, enhancing information for future human and machine processing.

This thesis consists of investigating and developing methods for producing enriched transcripts, that can be applied to real life speech recognition systems. This presentation reveals the work already performed in the scope of this thesis and points future possible directions for its completion. The research performed so far focuses on the punctuation marks and capitalization recovery, which are important Rich Transcription tasks, gaining now increasingly attention from the scientific community. This study has been performed mainly over broadcast news speech transcriptions, but other sources of information, such as written newspaper corpora have also been used. Most of the experiments are performed over manual and automatic speech transcripts, allowing to establish the impact of the recognition errors. Experiments concerning capitalization recovery and punctuation of spoken texts, provide the first evaluation results of these two tasks on Portuguese broadcast news data.