BDFALA Corpus: Difference between revisions

From HLT@INESC-ID

No edit summary
 
No edit summary
Line 1: Line 1:
__NOTOC__
The BDFALA corpus was jointly developed by INESC and CLUL in the framework of the national project sponsored by JNICT (Program Lusitânia).
The BDFALA corpus was jointly developed by INESC and CLUL in the framework of the national project sponsored by JNICT (Program Lusitânia).



Revision as of 03:21, 13 February 2006

The BDFALA corpus was jointly developed by INESC and CLUL in the framework of the national project sponsored by JNICT (Program Lusitânia).

Goal: enlargement of the EUROM.1 corpus, mainly for the improvement of speech synthesis systems.

Linguistic Contents

6 types of corpus material were collected:

  • ~4600 isolated words
  • 350 sentences for prosodic studies
  • 18 phonetically-complete paragraphs
  • 60 read paragraphs extracted from television debates
  • ~3000 logatomes
  • 600 phonetically rich sentences

Number and Type of Speakers

The 8 speakers were selected to achieve a balance in terms of sex, age groups and, as much as possible, among speakers of the EUROM.1 corpus. The two latter corpus types were only spoken by one male and one female speakers. A subset was also read by two young speakers (one male and one female), 12-14 years old, which were also recorded in EUROM.1.

Data Collection

Data collection took place in a sound-proof room. Two recording modes were adopted: in the case of isolated words and logatomes, the material was read from paper and recorded directly to DAT. The speech material was semi-automatically segmented and validated a posteriori. In the second mode (remaining sentences and paragraphs), a self-monitoring program was adopted which recorded directly into disc. The recordings were duly calibrated in both cases. The sampling frequency was 16 kHz.

Annotation

For each spoken item, the corresponding orthographic script is saved in a separate ASCII file. A pronunciation lexicon with citation phonemic transcriptions for each word is also included. These were automatically produced and hand-corrected a posteriori.

Packaging

The corpus material amounts to around 2.4 Gb, and is stored in 4 CDROMs.