SPEECHDAT Corpus

From HLT@INESC-ID

The SPEECHDAT corpus collection for European Portuguese was divided into 2 phases: collection of 1000 telephone calls (preparatory MLAP Project SPEECHDAT-M); and collection of 4000 telephone calls (Language Engineering Project SPEECHDAT-II). The project incorporates databases from all official languages of the E. U. and some major dialectal variants. The work was done by INESC under a subcontract with Portugal Telecom.

Goal: realistic corpus for training and assessment of isolated and continuous speech utterances (whole word or subword approaches), which can be used for developing voice driven teleservices.

Linguistic Contents

In the second phase, each speaker is asked to answer 7 spontaneous questions, some of them related to demographic information (e.g. date and place of birth) and to read a prompt sheet with 33 items. 4000 different prompt sheets were produced. The material for each speaker comprises (see example of prompt sheet): 3 application words (chosen from a vocabulary of 30)

  • 1 sequence of isolated digits
  • 4 connected digits (prompt sheet, telephone, credit card, PIN code)
  • 1 word spotting phrase
  • 1 isolated digit, 1 natural number and 1 currency amount
  • 3 spelled words (1 spontaneous (forename) + 2 read)
  • 5 directory assistance names (2 spontaneous (forename and place of birth) + 3 read)
  • 2 questions (predominantly yes and no, but also fuzzy answers)
  • 3 dates (1 spontaneous (date of birth) + 2 read)
  • 2 time phrases (1 spontaneous (time of day) + 1 read)
  • 4 phonetically rich words (chosen from a set of 4000)
  • 9 phonetically rich sentences (chosen from a set of 3600)

Number and Type of Speakers

Speaker selection is done among employees of Portugal Telecom and their relatives and friends, achieving a broad regional coverage. The age distribution exceeds 20% for the 3 main age groups considered: 16-30, 31-45 46-60. Gender distribution is close to ideal (47% male and 53% female).

Data Collection

The design of the collection platform (PC with 2 Dialogic boards) and the data collection itself are the responsibility of INESCTEL. PCM A-law format was adopted.

Annotation

Each speech file has an ASCII SAM label file with information about calling session, recording conditions, speaker sex, age and accent, signal file, recording date and time, assessment codes and label file body itself. This includes the prompting script and the orthographic transcription. A pronunciation lexicon with citation phonemic transcriptions for each word is also produced.

Packaging

The corpus material for the first phase is stored in 3 CDROMs (one for the phonetically rich sentences), using compressed signal files. The first 1000 speakers of the second phase are stored in 3 CDROMs, but the signal files were not compressed.

See Also