CORAL Corpus: Difference between revisions

From HLT@INESC-ID

mNo edit summary
No edit summary
 
(4 intermediate revisions by 2 users not shown)
Line 1: Line 1:
__NOTOC__
The CORAL corpus was collected in the framework of a national project sponsored by the PRAXIS XXI program, by a consortium formed by INESC, CLUL, FLUL (Faculdade de Letras da Universidade de Lisboa), and FCSH-UNL (Faculdade de Ciências Sociais e Humanas da Universidade Nova de Lisboa). The purpose of this project is the collection of a spoken dialogue corpus, with several levels of labelling: orthographic, phonetic, phonological, syntactic and semantic.
The CORAL corpus was collected in the framework of a national project sponsored by the PRAXIS XXI program, by a consortium formed by INESC, CLUL, FLUL (Faculdade de Letras da Universidade de Lisboa), and FCSH-UNL (Faculdade de Ciências Sociais e Humanas da Universidade Nova de Lisboa). The purpose of this project is the collection of a spoken dialogue corpus, with several levels of labelling: orthographic, phonetic, phonological, syntactic and semantic.


Line 5: Line 4:
[[Image:Mapa6d.gif|thumb|100px|giver]] [[Image:Mapa6s.gif|thumb|100px|follower]]
[[Image:Mapa6d.gif|thumb|100px|giver]] [[Image:Mapa6s.gif|thumb|100px|follower]]
64 dialogues about a predetermined subject: maps. One of the participants (''giver'') has a map with some landmarks and a route drawn between them; the other (''follower'') has also landmarks, but no route and consequently must reconstruct it. In order to elicit conversation, there are small differences between the two maps: one of the landmarks is duplicated in one map and single in the other; some landmarks are only present in one of the maps; and some have slightly different names in the two maps (e.g. ''curvas perigosas'' vs. ''troço sinuoso''). In the 16 different maps, the names of the landmarks were chosen to allow the study of some connected speech phenomena:  
64 dialogues about a predetermined subject: maps. One of the participants (''giver'') has a map with some landmarks and a route drawn between them; the other (''follower'') has also landmarks, but no route and consequently must reconstruct it. In order to elicit conversation, there are small differences between the two maps: one of the landmarks is duplicated in one map and single in the other; some landmarks are only present in one of the maps; and some have slightly different names in the two maps (e.g. ''curvas perigosas'' vs. ''troço sinuoso''). In the 16 different maps, the names of the landmarks were chosen to allow the study of some connected speech phenomena:  
* Sequences with '''/l/''' favouring or not its velarization (e.g. ''sala malva'', ''sal amargo'')  
* Sequences with '''/l/''' favouring or not its velarization (e.g. ''sala malva'', ''sal amargo'')  
* Sequences with '''/s/''' in word final position followed by another coronal fricative (e.g. ''barcos salva-vidas'')  
* Sequences with '''/s/''' in word final position followed by another coronal fricative (e.g. ''barcos salva-vidas'')  
Line 14: Line 14:
== Number and Type of Speakers ==
== Number and Type of Speakers ==


The 32 speakers were divided into 8 quartets and, in each quartet, organized to take part in 8 dialogues. Given the reduced number of speakers, they were chosen to achieve an adequate balance of sexes, but were restricted in terms of age (under-graduate or graduate students) and accent (Lisbon area). Speakers are chosen in pairs who know each other, so that half of the conversations take place between "friends" and half between people who do not know each other.
The 32 speakers were divided into 8 quartets and, in each quartet, organized to take part in 8 dialogues. Given the reduced number of speakers, they were chosen to achieve an adequate balance of sexes, but were restricted in terms of age (under-graduate or graduate students) and accent (Lisbon area). Speakers were chosen in pairs who know each other, so that half of the conversations take place between "friends" and half between people who do not knew each other.


== Data Collection ==
== Data Collection ==


The recordings take place in a sound proof room, with no visual contact between the speakers. They wear close-talking microphones and the recordings are made in stereo directly to DAT and later down-sampled to 16 kHz per channel. No monitoring is done once the dialogues start, after adjusting recording levels. The recording phase is not yet complete.
The recordings take place in a sound proof room, with no visual contact between the speakers. They wear close-talking microphones and the recordings are made in stereo directly to DAT and later down-sampled to 16 kHz per channel. No monitoring is done once the dialogues start, after adjusting recording levels.  


== Annotation ==
== Annotation ==


A subset of the corpus will be annotated in several levels. Only [[CORAL ortographic transcription|orthographic transcription]] will be done for the whole corpus.
Only [[http://www.l2f.inesc-id.pt/resources/coral/ortograf.html orthographic transcription]] was done for the whole corpus. A pilot recording was annotated in several levels.


[[category:Resources]]
[[category:Resources]]
[[category:Corpora]]
[[category:Corpora]]

Latest revision as of 10:05, 26 April 2013

The CORAL corpus was collected in the framework of a national project sponsored by the PRAXIS XXI program, by a consortium formed by INESC, CLUL, FLUL (Faculdade de Letras da Universidade de Lisboa), and FCSH-UNL (Faculdade de Ciências Sociais e Humanas da Universidade Nova de Lisboa). The purpose of this project is the collection of a spoken dialogue corpus, with several levels of labelling: orthographic, phonetic, phonological, syntactic and semantic.

Linguistic Contents

giver
follower

64 dialogues about a predetermined subject: maps. One of the participants (giver) has a map with some landmarks and a route drawn between them; the other (follower) has also landmarks, but no route and consequently must reconstruct it. In order to elicit conversation, there are small differences between the two maps: one of the landmarks is duplicated in one map and single in the other; some landmarks are only present in one of the maps; and some have slightly different names in the two maps (e.g. curvas perigosas vs. troço sinuoso). In the 16 different maps, the names of the landmarks were chosen to allow the study of some connected speech phenomena:

  • Sequences with /l/ favouring or not its velarization (e.g. sala malva, sal amargo)
  • Sequences with /s/ in word final position followed by another coronal fricative (e.g. barcos salva-vidas)
  • Sequences of plosives formed across word boundaries (e.g. clube de tiro)
  • Sequences of obstruents formed within and across word boundaries (e.g. bairros degradados)

The last three items were designed to allow a more comprehensive study of consonant clusters formed within and across word boundaries and should, therefore, be jointly investigated.

Number and Type of Speakers

The 32 speakers were divided into 8 quartets and, in each quartet, organized to take part in 8 dialogues. Given the reduced number of speakers, they were chosen to achieve an adequate balance of sexes, but were restricted in terms of age (under-graduate or graduate students) and accent (Lisbon area). Speakers were chosen in pairs who know each other, so that half of the conversations take place between "friends" and half between people who do not knew each other.

Data Collection

The recordings take place in a sound proof room, with no visual contact between the speakers. They wear close-talking microphones and the recordings are made in stereo directly to DAT and later down-sampled to 16 kHz per channel. No monitoring is done once the dialogues start, after adjusting recording levels.

Annotation

Only [orthographic transcription] was done for the whole corpus. A pilot recording was annotated in several levels.