CORAL Corpus
From HLT@INESC-ID
The CORAL corpus was collected in the framework of a national project sponsored by the PRAXIS XXI program, by a consortium formed by INESC, CLUL, FLUL (Faculdade de Letras da Universidade de Lisboa), and FCSH-UNL (Faculdade de Ciências Sociais e Humanas da Universidade Nova de Lisboa). The purpose of this project is the collection of a spoken dialogue corpus, with several levels of labelling: orthographic, phonetic, phonological, syntactic and semantic.
Linguistic Contents
64 dialogues about a predetermined subject: maps. One of the participants (giver) has a map with some landmarks and a route drawn between them; the other (follower) has also landmarks, but no route and consequently must reconstruct it. In order to elicit conversation, there are small differences between the two maps: one of the landmarks is duplicated in one map and single in the other; some landmarks are only present in one of the maps; and some have slightly different names in the two maps (e.g. curvas perigosas vs. troço sinuoso). In the 16 different maps, the names of the landmarks were chosen to allow the study of some connected speech phenomena:
- Sequences with /l/ favouring or not its velarization (e.g. sala malva, sal amargo)
- Sequences with /s/ in word final position followed by another coronal fricative (e.g. barcos salva-vidas)
- Sequences of plosives formed across word boundaries (e.g. clube de tiro)
- Sequences of obstruents formed within and across word boundaries (e.g. bairros degradados)
The last three items were designed to allow a more comprehensive study of consonant clusters formed within and across word boundaries and should, therefore, be jointly investigated.
Number and Type of Speakers
The 32 speakers were divided into 8 quartets and, in each quartet, organized to take part in 8 dialogues. Given the reduced number of speakers, they were chosen to achieve an adequate balance of sexes, but were restricted in terms of age (under-graduate or graduate students) and accent (Lisbon area). Speakers were chosen in pairs who know each other, so that half of the conversations take place between "friends" and half between people who do not knew each other.
Data Collection
The recordings take place in a sound proof room, with no visual contact between the speakers. They wear close-talking microphones and the recordings are made in stereo directly to DAT and later down-sampled to 16 kHz per channel. No monitoring is done once the dialogues start, after adjusting recording levels.
Annotation
Only [orthographic transcription] was done for the whole corpus. A pilot recording was annotated in several levels.