<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://www.hlt.inesc-id.pt/wiki/index.php?action=history&amp;feed=atom&amp;title=BD-P%C3%9ABLICO_Corpus</id>
	<title>BD-PÚBLICO Corpus - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://www.hlt.inesc-id.pt/wiki/index.php?action=history&amp;feed=atom&amp;title=BD-P%C3%9ABLICO_Corpus"/>
	<link rel="alternate" type="text/html" href="https://www.hlt.inesc-id.pt/wiki/index.php?title=BD-P%C3%9ABLICO_Corpus&amp;action=history"/>
	<updated>2026-05-31T23:29:50Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.41.0</generator>
	<entry>
		<id>https://www.hlt.inesc-id.pt/wiki/index.php?title=BD-P%C3%9ABLICO_Corpus&amp;diff=2496&amp;oldid=prev</id>
		<title>David at 13:24, 3 June 2006</title>
		<link rel="alternate" type="text/html" href="https://www.hlt.inesc-id.pt/wiki/index.php?title=BD-P%C3%9ABLICO_Corpus&amp;diff=2496&amp;oldid=prev"/>
		<updated>2006-06-03T13:24:33Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 13:24, 3 June 2006&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l1&quot;&gt;Line 1:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 1:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;__NOTOC__&lt;/del&gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;{{TOCright}}&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;The BD-PUBLICO database (Base de Dados em Português eUropeu, vocaBulário Largo, Independente do orador e fala COntínua) was collected by INESC in the framework of an European project (SPRACH), and a national project (PRAXIS XXI Program), and with the collaboration of Instituto Superior Técnico (IST) and the PÚBLICO newspaper. This corpus aimed at the development of large vocabulary, speaker-independent continuous speech recognition systems.  &lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;The BD-PUBLICO database (Base de Dados em Português eUropeu, vocaBulário Largo, Independente do orador e fala COntínua) was collected by INESC in the framework of an European project (SPRACH), and a national project (PRAXIS XXI Program), and with the collaboration of Instituto Superior Técnico (IST) and the PÚBLICO newspaper. This corpus aimed at the development of large vocabulary, speaker-independent continuous speech recognition systems.  &lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l28&quot;&gt;Line 28:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 28:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;The corpus material amounts to more than 2 Gb, and was packed into 4 CDROMs.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;The corpus material amounts to more than 2 Gb, and was packed into 4 CDROMs.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;[[category:Resources]]&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;[[category:Corpora]]&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>David</name></author>
	</entry>
	<entry>
		<id>https://www.hlt.inesc-id.pt/wiki/index.php?title=BD-P%C3%9ABLICO_Corpus&amp;diff=1352&amp;oldid=prev</id>
		<title>Root at 03:06, 13 February 2006</title>
		<link rel="alternate" type="text/html" href="https://www.hlt.inesc-id.pt/wiki/index.php?title=BD-P%C3%9ABLICO_Corpus&amp;diff=1352&amp;oldid=prev"/>
		<updated>2006-02-13T03:06:19Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 03:06, 13 February 2006&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l1&quot;&gt;Line 1:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 1:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;__NOTOC__&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;The BD-PUBLICO database (Base de Dados em Português eUropeu, vocaBulário Largo, Independente do orador e fala COntínua) was collected by INESC in the framework of an European project (SPRACH), and a national project (PRAXIS XXI Program), and with the collaboration of Instituto Superior Técnico (IST) and the PÚBLICO newspaper. This corpus aimed at the development of large vocabulary, speaker-independent continuous speech recognition systems.  &lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;The BD-PUBLICO database (Base de Dados em Português eUropeu, vocaBulário Largo, Independente do orador e fala COntínua) was collected by INESC in the framework of an European project (SPRACH), and a national project (PRAXIS XXI Program), and with the collaboration of Instituto Superior Técnico (IST) and the PÚBLICO newspaper. This corpus aimed at the development of large vocabulary, speaker-independent continuous speech recognition systems.  &lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>Root</name></author>
	</entry>
	<entry>
		<id>https://www.hlt.inesc-id.pt/wiki/index.php?title=BD-P%C3%9ABLICO_Corpus&amp;diff=1351&amp;oldid=prev</id>
		<title>Root at 03:05, 13 February 2006</title>
		<link rel="alternate" type="text/html" href="https://www.hlt.inesc-id.pt/wiki/index.php?title=BD-P%C3%9ABLICO_Corpus&amp;diff=1351&amp;oldid=prev"/>
		<updated>2006-02-13T03:05:40Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;The BD-PUBLICO database (Base de Dados em Português eUropeu, vocaBulário Largo, Independente do orador e fala COntínua) was collected by INESC in the framework of an European project (SPRACH), and a national project (PRAXIS XXI Program), and with the collaboration of Instituto Superior Técnico (IST) and the PÚBLICO newspaper. This corpus aimed at the development of large vocabulary, speaker-independent continuous speech recognition systems. &lt;br /&gt;
&lt;br /&gt;
== Linguistic Contents ==&lt;br /&gt;
&lt;br /&gt;
The text material for the read sentences was extracted from the Portuguese newspaper PÚBLICO, consisting of 6 months of news, totalling 10M words and 156k different forms.&lt;br /&gt;
 &lt;br /&gt;
The corpus is based on 3 sets: &lt;br /&gt;
*Training set: 80 sentences plus 3 calibration sentences for each speaker. &lt;br /&gt;
*Development set: 40 sentences plus 15 speaker-adaptation sentences per speaker. &lt;br /&gt;
*Evaluation set: 40 sentences plus 15 speaker-adaptation sentences and 3 calibration sentences for each speaker. &lt;br /&gt;
Two vocabulary sizes: 5K and 20K (later recording phase) &lt;br /&gt;
&lt;br /&gt;
== Number and Type of Speakers ==&lt;br /&gt;
&lt;br /&gt;
Speaker selection was done among undergraduate and graduate students from IST. Ages ranged between 19 and 28 and a broad coverage of accents was obtained. &lt;br /&gt;
&lt;br /&gt;
We recorded a total of 120 speakers with 100 for the training set (50 male and 50 female) and 20 speakers (10 male and 10 female) divided equally in the 5K word sets (evaluation / development). Each recording session resulted in approximately 15 minutes of speech. &lt;br /&gt;
&lt;br /&gt;
== Data Collection ==&lt;br /&gt;
&lt;br /&gt;
The recordings were done in a sound proof room at INESC (Lisbon) using a high quality microphone, directly to disc with 16kHz sampling frequency. &lt;br /&gt;
Annotation &lt;br /&gt;
&lt;br /&gt;
A pronunciation lexicon with citation phonemic transcriptions for each word was produced by hand-correcting the automatically generated transcriptions. &lt;br /&gt;
&lt;br /&gt;
== Packaging ==&lt;br /&gt;
&lt;br /&gt;
The corpus material amounts to more than 2 Gb, and was packed into 4 CDROMs.&lt;/div&gt;</summary>
		<author><name>Root</name></author>
	</entry>
</feed>