The aim of our study is twofold: to quantify the distinct interrogative types in different contexts and to discuss the weight of the linguistic features that best describe these structures, in order to model interrogatives in speech.
In European Portuguese, as in other languages, interrogatives may be sub classified in yes-no questions, wh-questions and tags. These distinctions are accompanied by lexico-syntactic and prosodic features. Yes-no questions have the same syntactic structure of a declarative and may be differentiated by intonation contours; wh-questions have wh-words which make them recognizable; and tags have also distinctive forms, e.g., declarative sentence + negative particle + verb.
State-of-the-art studies on sentence boundary detection and punctuation have discussed the relative weight of the previously mentioned features. Shriberg et al. (1998; 2008) report that prosodic features are more significant than lexical ones and that better results are achieved when combining both features; Wang and Narayanan (2004) claim that results based only on prosodic properties are quite robust; Boakye et al. (2009), analyzing meetings, state that lexico-syntactic features are the most important ones to identify interrogatives. This raises the following question: is the weight of the features dependent on the nature of the corpus and on the most characteristic types of interrogative in each?
This study addresses that question, using three distinct corpora for European Portuguese: broadcast news (61h, 449k words), classroom lectures (27h, 155k words), and map-task dialogues (7h, 61k words). Results will also be presented for newspaper text (148M words).