ALERT Corpus: Difference between revisions

From HLT@INESC-ID

No edit summary
 
No edit summary
Line 5: Line 5:
The corpus has 3 main parts:  
The corpus has 3 main parts:  


* Speeh Recognition Corpus (SRC) The main goal of this corpus was the training of the acoustic models and the adaptation of the language models used in the large vocabulary speech recognition component of our system.<br/>The Speech Recognition Corpus was collected from November 2000 through January 2001, including 122 programs of different types and schedules and amounting to 76h of audio data. The training data of the speech recognition corpus was recorded during October and November of 2000 (61 hours). The development data was recorded in one week in December (8 hours) and the evaluation data during one week in January (6 hours).<br/>The orthographic transcriptions of this corpus were first automatically produced and later manually verified.
* '''Speeh Recognition Corpus''' (SRC) The main goal of this corpus was the training of the acoustic models and the adaptation of the language models used in the large vocabulary speech recognition component of our system.<br/>The Speech Recognition Corpus was collected from November 2000 through January 2001, including 122 programs of different types and schedules and amounting to 76h of audio data. The training data of the speech recognition corpus was recorded during October and November of 2000 (61 hours). The development data was recorded in one week in December (8 hours) and the evaluation data during one week in January (6 hours).<br/>The orthographic transcriptions of this corpus were first automatically produced and later manually verified.


* Topic Detection Corpus (TDC) The main goal of this corpus was to have a broader coverage of topics and associated topic classification for training our topic indexation module.<br/>The Topic Detection Corpus contains data related to 133 TV broadcast of the 8 o'clock evening news program. It comprises close to 300 hours of recordings on a daily basis and over a period of 9 months, starting in February 2001.<br/>For the Topic Detection Corpus, we only have the automatic orthographic transcriptions and the manual segmentation and indexation of the stories made by the RTP staff in charge of the daily program indexing. Each show was manually segmented into stories and each story was manually classified according to a thematic, geographic and onomastic (names of persons, companies and institutions) thesaurus. Commercial breaks were annotated as non-news data. The thesaurus is currently structured into 21 thematic areas, each of them hierarchically divided. The structure of this thesaurus follows rules which are generally adopted within EBU (European Broadcast Union).
* '''Topic Detection Corpus''' (TDC) The main goal of this corpus was to have a broader coverage of topics and associated topic classification for training our topic indexation module.<br/>The Topic Detection Corpus contains data related to 133 TV broadcast of the 8 o'clock evening news program. It comprises close to 300 hours of recordings on a daily basis and over a period of 9 months, starting in February 2001.<br/>For the Topic Detection Corpus, we only have the automatic orthographic transcriptions and the manual segmentation and indexation of the stories made by the RTP staff in charge of the daily program indexing. Each show was manually segmented into stories and each story was manually classified according to a thematic, geographic and onomastic (names of persons, companies and institutions) thesaurus. Commercial breaks were annotated as non-news data. The thesaurus is currently structured into 21 thematic areas, each of them hierarchically divided. The structure of this thesaurus follows rules which are generally adopted within EBU (European Broadcast Union).


* Textual Corpus (TC) Since the minimum amount defined for textual data (100 million words) was already available prior to the project start (see BD-Publico), there was no need to collect this type of data in the scope of the project. However, the newspaper texts that can be daily extracted from the internet constitute a very powerful resource for improving and keeping up-to-date language models and pronunciation lexica. Hence, L2F is keeping up with this daily collection activity, now reaching close to 450 million words.
* '''Textual Corpus''' (TC) Since the minimum amount defined for textual data (100 million words) was already available prior to the project start (see BD-Publico), there was no need to collect this type of data in the scope of the project. However, the newspaper texts that can be daily extracted from the internet constitute a very powerful resource for improving and keeping up-to-date language models and pronunciation lexica. Hence, L2F is keeping up with this daily collection activity, now reaching close to 450 million words.


Prior to the collection of the SRC and TDC corpora, we collected a relative small Pilot Corpus which was used to discuss and setup the collection process, and the most appropriate kind of programs to collect. This corpus was recorded during one week in April 2000, amounting to 5.5 hours. For the pilot corpus, the audio was recorded at 44.1 KHz at 16 bits/sample. The final corpus was recorded at 32 KHz. Both were later downsampled to 16 kHz. This is the only corpus for which we also collected video data (MPEG-1). Manually corrected orthographic transcriptions and topic labels were added to this corpus.
Prior to the collection of the SRC and TDC corpora, we collected a relative small Pilot Corpus which was used to discuss and setup the collection process, and the most appropriate kind of programs to collect. This corpus was recorded during one week in April 2000, amounting to 5.5 hours. For the pilot corpus, the audio was recorded at 44.1 KHz at 16 bits/sample. The final corpus was recorded at 32 KHz. Both were later downsampled to 16 kHz. This is the only corpus for which we also collected video data (MPEG-1). Manually corrected orthographic transcriptions and topic labels were added to this corpus.

Revision as of 02:43, 13 February 2006

The ALERT corpus was collected in the framework of the European project with the same name, with the goal of gathering material for training and evaluating several components of the ALERT media watch system for European Portuguese.

RTP, as the Portuguese data provider in this project, was responsible for collecting the data at their premises. INESC ID was responsible for defining a schedule for the recordings, helping training the annotators, verifying the annotation and for packaging the data. 4VDO was responsible for defining and setting up the recording conditions. The orthographic transcription process was jointly done by RTP and INESC ID, and made using the Transcriber tool, following the LDC Hub4 (Broadcast Speech) transcription conventions.

The corpus has 3 main parts:

  • Speeh Recognition Corpus (SRC) The main goal of this corpus was the training of the acoustic models and the adaptation of the language models used in the large vocabulary speech recognition component of our system.
    The Speech Recognition Corpus was collected from November 2000 through January 2001, including 122 programs of different types and schedules and amounting to 76h of audio data. The training data of the speech recognition corpus was recorded during October and November of 2000 (61 hours). The development data was recorded in one week in December (8 hours) and the evaluation data during one week in January (6 hours).
    The orthographic transcriptions of this corpus were first automatically produced and later manually verified.
  • Topic Detection Corpus (TDC) The main goal of this corpus was to have a broader coverage of topics and associated topic classification for training our topic indexation module.
    The Topic Detection Corpus contains data related to 133 TV broadcast of the 8 o'clock evening news program. It comprises close to 300 hours of recordings on a daily basis and over a period of 9 months, starting in February 2001.
    For the Topic Detection Corpus, we only have the automatic orthographic transcriptions and the manual segmentation and indexation of the stories made by the RTP staff in charge of the daily program indexing. Each show was manually segmented into stories and each story was manually classified according to a thematic, geographic and onomastic (names of persons, companies and institutions) thesaurus. Commercial breaks were annotated as non-news data. The thesaurus is currently structured into 21 thematic areas, each of them hierarchically divided. The structure of this thesaurus follows rules which are generally adopted within EBU (European Broadcast Union).
  • Textual Corpus (TC) Since the minimum amount defined for textual data (100 million words) was already available prior to the project start (see BD-Publico), there was no need to collect this type of data in the scope of the project. However, the newspaper texts that can be daily extracted from the internet constitute a very powerful resource for improving and keeping up-to-date language models and pronunciation lexica. Hence, L2F is keeping up with this daily collection activity, now reaching close to 450 million words.

Prior to the collection of the SRC and TDC corpora, we collected a relative small Pilot Corpus which was used to discuss and setup the collection process, and the most appropriate kind of programs to collect. This corpus was recorded during one week in April 2000, amounting to 5.5 hours. For the pilot corpus, the audio was recorded at 44.1 KHz at 16 bits/sample. The final corpus was recorded at 32 KHz. Both were later downsampled to 16 kHz. This is the only corpus for which we also collected video data (MPEG-1). Manually corrected orthographic transcriptions and topic labels were added to this corpus.