This presentation addresses two important metadata annotation tasks, involved in the production of rich transcripts: capitalization and recovery of punctuation marks. The main focus of this study concerns broadcast news, using both manual and automatic speech transcripts. Different capitalization models were analysed and compared, indicating that generative approaches capture the structure of written corpora better, while the discriminative approaches are suitable for dealing with speech transcripts, and are also more robust to ASR errors. The so-called language dynamics have been addressed, and results indicate that the capitalization performance is affected by the temporal distance between the training and testing data. In what concerns the punctuation task, this study covers the three most frequent marks: full stop, comma, and question mark. Early experiments addressed full-stop and comma recovery, using local features, and combining lexical and acoustic information. Recent experiments also combine prosodic information and extend this study to question marks.
Note: This seminar will be held in English.