BDFALA Corpus: Difference between revisions

From HLT@INESC-ID

No edit summary
 
mNo edit summary
 
(One intermediate revision by one other user not shown)
Line 1: Line 1:
{{TOCright}}
The BDFALA corpus was jointly developed by INESC and CLUL in the framework of the national project sponsored by JNICT (Program Lusitânia).
The BDFALA corpus was jointly developed by INESC and CLUL in the framework of the national project sponsored by JNICT (Program Lusitânia).


Line 28: Line 29:


The corpus material amounts to around 2.4 Gb, and is stored in 4 CDROMs.
The corpus material amounts to around 2.4 Gb, and is stored in 4 CDROMs.
[[category:Resources]]
[[category:Corpora]]

Latest revision as of 13:07, 3 June 2006

The BDFALA corpus was jointly developed by INESC and CLUL in the framework of the national project sponsored by JNICT (Program Lusitânia).

Goal: enlargement of the EUROM.1 corpus, mainly for the improvement of speech synthesis systems.

Linguistic Contents

6 types of corpus material were collected:

  • ~4600 isolated words
  • 350 sentences for prosodic studies
  • 18 phonetically-complete paragraphs
  • 60 read paragraphs extracted from television debates
  • ~3000 logatomes
  • 600 phonetically rich sentences

Number and Type of Speakers

The 8 speakers were selected to achieve a balance in terms of sex, age groups and, as much as possible, among speakers of the EUROM.1 corpus. The two latter corpus types were only spoken by one male and one female speakers. A subset was also read by two young speakers (one male and one female), 12-14 years old, which were also recorded in EUROM.1.

Data Collection

Data collection took place in a sound-proof room. Two recording modes were adopted: in the case of isolated words and logatomes, the material was read from paper and recorded directly to DAT. The speech material was semi-automatically segmented and validated a posteriori. In the second mode (remaining sentences and paragraphs), a self-monitoring program was adopted which recorded directly into disc. The recordings were duly calibrated in both cases. The sampling frequency was 16 kHz.

Annotation

For each spoken item, the corresponding orthographic script is saved in a separate ASCII file. A pronunciation lexicon with citation phonemic transcriptions for each word is also included. These were automatically produced and hand-corrected a posteriori.

Packaging

The corpus material amounts to around 2.4 Gb, and is stored in 4 CDROMs.