BDFALA Corpus

The BDFALA corpus was jointly developed by INESC and CLUL in the framework of the national project sponsored by JNICT (Program Lusitânia).

Goal: enlargement of the EUROM.1 corpus, mainly for the improvement of speech synthesis systems.

Linguistic Contents

6 types of corpus material were collected:

~4600 isolated words
350 sentences for prosodic studies
18 phonetically-complete paragraphs
60 read paragraphs extracted from television debates
~3000 logatomes
600 phonetically rich sentences

Number and Type of Speakers

The 8 speakers were selected to achieve a balance in terms of sex, age groups and, as much as possible, among speakers of the EUROM.1 corpus. The two latter corpus types were only spoken by one male and one female speakers. A subset was also read by two young speakers (one male and one female), 12-14 years old, which were also recorded in EUROM.1.

Data Collection

Data collection took place in a sound-proof room. Two recording modes were adopted: in the case of isolated words and logatomes, the material was read from paper and recorded directly to DAT. The speech material was semi-automatically segmented and validated a posteriori. In the second mode (remaining sentences and paragraphs), a self-monitoring program was adopted which recorded directly into disc. The recordings were duly calibrated in both cases. The sampling frequency was 16 kHz.

Annotation

For each spoken item, the corresponding orthographic script is saved in a separate ASCII file. A pronunciation lexicon with citation phonemic transcriptions for each word is also included. These were automatically produced and hand-corrected a posteriori.

Packaging

The corpus material amounts to around 2.4 Gb, and is stored in 4 CDROMs.

BDFALA Corpus

From HLT@INESC-ID

Revision as of 03:20, 13 February 2006 by Root (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Contents

Linguistic Contents

Number and Type of Speakers

Data Collection

Annotation

Packaging

BDFALA Corpus

From HLT@INESC-ID

Revision as of 03:20, 13 February 2006 by Root (talk | contribs)(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Linguistic Contents

Number and Type of Speakers

Data Collection

Annotation

Packaging

Revision as of 03:20, 13 February 2006 by Root (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)