EUROM.1 Corpus


Revision as of 02:33, 13 February 2006 by Root (Talk | contribs)

The EUROM.1 corpus for European Portuguese was collected in the framework of the SAM_A European project, jointly by INESC and CLUL. This project was in fact an extension of a preliminary project (SAM - Speech Assessment Methods) during which work on the planning of a poly-language resource for the Spoken Language Engineering needs of the European Union was first started. Despite its main use for recognition and synthesis research, this corpus has also been used in our group for phonetic coding research.

Linguistic Contents

For each of the 11 languages contemplated in this project, 4 types of corpus material were collected:

  • CVC material (totalling 121 different logatomes) in isolation and in context (5 carrier phrases)
  • 100 selected numbers from 0-9999
  • 40 short passages each containing 5 thematically connected sentences (half of the passages were freely translated from the English version of EUROM.1; most of the remaining ones were adapted from Portuguese books and newspapers)
  • 50 filler sentences to compensate for the phoneme-frequency imbalance in the passages

Number and Type of Speakers

The corpus was structured into 3 target corpora subsets:

  • Many Talker Corpus (30 male + 30 female speakers): 100 numbers, 3 passages, 5 sentences
  • Few Talker Corpus (5 male + 5 female, selected from MANY): 5 x CVCs, 5 x 100 numbers, 15 passages and 25 sentences.
  • Very Few Talker Corpus (1 male + 1 female selected from FEW): CVC in context.

The speakers were selected to cover a wide range of age groups and normal voice types. One main accent group was selected (Lisbon area), together with a small number of speakers from other accent regions.

Data Collection

The recordings were made in an anechoic chamber using a high quality microphone, directly to disc (using an A/D board), and to DAT tape. The EUROPEC program was adopted, prompting the items to be read on the computer screen. The sampling frequency was 20 kHz. Calibration followed the SAM recommendations as well. Careful monitoring was adopted.


The SAM project defined the format of the label files which were produced. Besides the orthographic transcription, these included information about the signal file and the recording session, among other items.


The corpus is contained into 5 CDROMs and totals 2.6 Gb.