Recover Capitalization and Punctuation Marks on Speech Transcriptions: Difference between revisions

Latest revision as of 14:38, 19 May 2011

Fernando Batista

Date

14:30, Wednesday, May 25^th, 2011
Room 20

Speaker

Fernando Batista

Abstract

This presentation addresses two important metadata annotation tasks, involved in the production of rich transcripts: capitalization and recovery of punctuation marks. The main focus of this study concerns broadcast news, using both manual and automatic speech transcripts. Different capitalization models were analysed and compared, indicating that generative approaches capture the structure of written corpora better, while the discriminative approaches are suitable for dealing with speech transcripts, and are also more robust to ASR errors. The so-called language dynamics have been addressed, and results indicate that the capitalization performance is affected by the temporal distance between the training and testing data. In what concerns the punctuation task, this study covers the three most frequent marks: full stop, comma, and question mark. Early experiments addressed full-stop and comma recovery, using local features, and combining lexical and acoustic information. Recent experiments also combine prosodic information and extend this study to question marks.

Note: This seminar will be held in English.

@@ Line 9: / Line 9: @@
 == Date ==
-* 14:30, Wednesday, May 25<sup>th</sup>, 2010
+* 14:30, Wednesday, May 25<sup>th</sup>, 2011
 * Room 20
@@ Line 21: / Line 21: @@
+'''Note:''' This seminar will be held in English.

Recover Capitalization and Punctuation Marks on Speech Transcriptions: Difference between revisions

From HLT@INESC-ID

Latest revision as of 14:38, 19 May 2011

Date

Speaker

Abstract