Resources: Difference between revisions

From HLT@INESC-ID

 
(7 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{{TOCright}}
{{TOCright}}
L²F has been particularly active in the creation of linguistic resources for European Portuguese. The cooperation with CLUL has been of paramount importance in this activity. The resources listed are in inverse chronological order. The corresponding webpages are in Portuguese.  
We have been active in the creation of linguistic resources for European Portuguese and for other languages. The cooperation with CLUL has been of paramount importance regarding some of the available resources.  


== Corpora ==
== Speech Corpora ==


=== Speech ===
* [[POSTPORT Corpus|POSTPORT]] - European, Brazilian and African varieties of Portuguese
* [[LECTRA Corpus|LECTRA]] - Classroom lectures
* [[IPSOM Pilot Corpus|IPSOM]] - Aligned spoken books
* [[ALERT Corpus|ALERT]] - Broadcast news  
* [[ALERT Corpus|ALERT]] - Broadcast news  
* [[BDFALA Corpus|BDFALA]] - Speech analysis / synthesis
* [[BD-PÚBLICO Corpus|BD-PÚBLICO]]- Large vocabulary, speaker-independent, continuous speech
* [[CORAL Corpus|CORAL]] - Spoken dialogues (map task)
* [[CORAL Corpus|CORAL]] - Spoken dialogues (map task)
* [[BD-PÚBLICO Corpus|BD-PÚBLICO]]- Large vocabulary, speaker-independent, continuous speech
* [[EUROM.1 Corpus|EUROM.1]] - Multi-Lingual speech corpus for phonetic comparison
* [[IPSOM Pilot Corpus|IPSOM]] - Aligned spoken books
* [[LECTRA Corpus|LECTRA]] - Classroom lectures
* POSTPORT - European, Brazilian and African varieties of Portuguese
* [[SPEECHDAT Corpus|SPEECHDAT]] - Multi-purpose telephone speech database
* [[SPEECHDAT Corpus|SPEECHDAT]] - Multi-purpose telephone speech database
* [[BDFALA Corpus|BDFALA]] - Speech analysis / synthesis
* '''[[VoxCeleb-PT]]''' - annotated corpus of European Portuguese celebrities.
* [[EUROM.1 Corpus|EUROM.1]] - Multi-Lingual speech corpus for phonetic comparison
 
=== Bilingual Corpus ===


* [[Word_Alignments|Golden collection of parallel multi-language word alignments]] - Manually annotated word alignments between six european languages taken from the Europarl common test set <br>(more information on the [[Speech-to-speech Translation]] information page)
== Text Corpora ==
* [[Word_Alignments|Europarl golden collection of parallel multi-language word alignments]] - Manually annotated word alignments between six European languages taken from the Europarl common test set (more information on the [[Speech-to-speech Translation]] information page)
* [https://www.hlt.inesc-id.pt/~u000775/LitRec-v1.tar.gz LitRec] - Book recommendation corpus


== Lexica ==
== Pronunciation Lexica ==
Pronunciation lexica (besides the ones included in the above corpora documentation):
The following pronunciation lexica use the SAMPA phonetic alphabet. See the [[SAMPA Table for European Portuguese|SAMPA table for European Portuguese]] and some comments about its design.
* '''ONOMASTICA''' (Proper names of 11 European languages, in cooperation with TLP - Telefones de Lisboa e Porto): ~ 100.000 names of people, streets, towns and companies
* '''ONOMASTICA''' (Proper names of 11 European languages, in cooperation with TLP - Telefones de Lisboa e Porto): ~ 100.000 names of people, streets, towns and companies
* '''PF''' (Português Fundamental): ~ 26.000 citation forms
* '''PF''' (Português Fundamental): ~ 26.000 citation forms
The pronunciation lexica developed by L²F use the SAMPA phonetic alphabet. See the [[SAMPA Table for European Portuguese|SAMPA table for European Portuguese]] and some comments about its design.


== See Also ==
== See Also ==


* [[Resource Links]]
* [[Resource Links]]
 
<!--
=== Newspapers ===
=== Newspapers ===
* [http://www.ims.uni-stuttgart.de/info/Newspapers.html List of Newspapers on the Internet] produced by [[Isabel Trancoso]] and maintained jointly with IMS Stuttgart.
* [http://www.ims.uni-stuttgart.de/info/Newspapers.html List of Newspapers on the Internet] produced by [[Isabel Trancoso]] and maintained jointly with IMS Stuttgart.
=== Language Resource Centers ===
* [http://www.linguateca.pt Linguateca] (Distributed language resource center for Portuguese)
* [http://www.icp.grenet.fr/ELRA/home.html ELRA] (European Language Resources Association)
* [http://morph.ldc.upenn.edu/ LDC] (Linguistic Data Consortium)


=== Dictionaries ===
=== Dictionaries ===
Line 47: Line 39:
* [http://www.fmi.uni-passau.de/htbin/lt/ltd German-English Dictionary]
* [http://www.fmi.uni-passau.de/htbin/lt/ltd German-English Dictionary]
* [http://nova.sti.nasa.gov/nasa-thesaurus.html NASA Thesaurus]
* [http://nova.sti.nasa.gov/nasa-thesaurus.html NASA Thesaurus]
 
-->
 
[[category: Resources]]
[[category:Resources]]

Latest revision as of 11:11, 26 December 2023

We have been active in the creation of linguistic resources for European Portuguese and for other languages. The cooperation with CLUL has been of paramount importance regarding some of the available resources.

Speech Corpora

  • ALERT - Broadcast news
  • BDFALA - Speech analysis / synthesis
  • BD-PÚBLICO- Large vocabulary, speaker-independent, continuous speech
  • CORAL - Spoken dialogues (map task)
  • EUROM.1 - Multi-Lingual speech corpus for phonetic comparison
  • IPSOM - Aligned spoken books
  • LECTRA - Classroom lectures
  • POSTPORT - European, Brazilian and African varieties of Portuguese
  • SPEECHDAT - Multi-purpose telephone speech database
  • VoxCeleb-PT - annotated corpus of European Portuguese celebrities.

Text Corpora

Pronunciation Lexica

The following pronunciation lexica use the SAMPA phonetic alphabet. See the SAMPA table for European Portuguese and some comments about its design.

  • ONOMASTICA (Proper names of 11 European languages, in cooperation with TLP - Telefones de Lisboa e Porto): ~ 100.000 names of people, streets, towns and companies
  • PF (Português Fundamental): ~ 26.000 citation forms

See Also