EUROM.1 Corpus

The EUROM.1 corpus for European Portuguese was collected in the framework of the SAM_A European project, jointly by INESC and CLUL. This project was in fact an extension of a preliminary project (SAM - Speech Assessment Methods) during which work on the planning of a poly-language resource for the Spoken Language Engineering needs of the European Union was first started. Despite its main use for recognition and synthesis research, this corpus has also been used in our group for phonetic coding research.

Linguistic Contents

For each of the 11 languages contemplated in this project, 4 types of corpus material were collected:

CVC material (totalling 121 different logatomes) in isolation and in context (5 carrier phrases)
100 selected numbers from 0-9999
40 short passages each containing 5 thematically connected sentences (half of the passages were freely translated from the English version of EUROM.1; most of the remaining ones were adapted from Portuguese books and newspapers)
50 filler sentences to compensate for the phoneme-frequency imbalance in the passages

Number and Type of Speakers

The corpus was structured into 3 target corpora subsets:

Many Talker Corpus (30 male + 30 female speakers): 100 numbers, 3 passages, 5 sentences
Few Talker Corpus (5 male + 5 female, selected from MANY): 5 x CVCs, 5 x 100 numbers, 15 passages and 25 sentences.
Very Few Talker Corpus (1 male + 1 female selected from FEW): CVC in context.

The speakers were selected to cover a wide range of age groups and normal voice types. One main accent group was selected (Lisbon area), together with a small number of speakers from other accent regions.

Data Collection

The recordings were made in an anechoic chamber using a high quality microphone, directly to disc (using an A/D board), and to DAT tape. The EUROPEC program was adopted, prompting the items to be read on the computer screen. The sampling frequency was 20 kHz. Calibration followed the SAM recommendations as well. Careful monitoring was adopted.

Annotation

The SAM project defined the format of the label files which were produced. Besides the orthographic transcription, these included information about the signal file and the recording session, among other items.

Packaging

The corpus is contained into 5 CDROMs and totals 2.6 Gb.

EUROM.1 Corpus

From HLT@INESC-ID

Revision as of 13:26, 3 June 2006 by David (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Contents

Linguistic Contents

Number and Type of Speakers

Data Collection

Annotation

Packaging

EUROM.1 Corpus

From HLT@INESC-ID

Revision as of 13:26, 3 June 2006 by David (talk | contribs)(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Linguistic Contents

Number and Type of Speakers

Data Collection

Annotation

Packaging

Revision as of 13:26, 3 June 2006 by David (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)