Difference between revisions of "L²F Day 2006"

From HLT@INESC-ID

Line 66: Line 66:
 
* LECTRA:  classroom lectures (pilot corpus); two semesters (under construction)
 
* LECTRA:  classroom lectures (pilot corpus); two semesters (under construction)
 
* PAPOUS: corpus of children stories performed by António Rito Silva (falsetto voice)
 
* PAPOUS: corpus of children stories performed by António Rito Silva (falsetto voice)
 +
 +
== Simple Text Processing Tools ==
 +
 +
* 11:20 - Presentation by [[Fernando Batista]].
 +
 +
* Morphological analysis
 +
** SMorph - POS tagger, tokenizer, generator
 +
** Palavroso - POS tagger, tokenizer, generator
 +
** Amorfo/XA - POS tagger, tokenizer, simultaneous multi-lingual analysis; error-correction (spelling correction)
 +
 +
* Morphological generation
 +
** Monge - general form generator; language- and tag-independent; uses LRDB (under development; usable)
 +
** Gover - verb gnerator (~10k manually corrected verbs)
 +
 +
* Morpho-syntax processing
 +
** PAsMo - rule-based rewriter
 +
** MARv - morpho-syntactic disambiguation
 +
 +
* Syntactic analysis
 +
** SuSAna -
 +
** ParVO - syntactic analyzer (Earley algorithm; variable unification; O(n³))
 +
 +
* Syntax-Semantics interface
 +
** Algas - arrowing construction
 +
** AsDeCopas -
 +
 +
* Other tools
 +
** text2syl - silabification
 +
** num2ext - text normalizer
 +
** YAH - (yet another) hyphenator (rule-based); MS Office compatible
 +
** Correcto - spell checker; MS Office compatible
 +
** leia -
 +
 +
* General purpose
 +
** FSTK lib - finite-state transduce toolkit

Revision as of 11:43, 17 February 2006

Integrated Tools and Ontologies

Information available at http://l2f.l2f.inesc-id.pt/ (intranet)

  • Integrated tools:
    • ATA
    • JaVaLi!
    • DID
    • SAF
  • 3rd Party
    • Intex
  • Ontologies:
    • OntoWine (wine domain ontology)
    • OntoChef (cooking domain ontology)

Lexicons

Information available at http://lrdb.l2f.inesc-id.pt/ (intranet)

  • PAROLE/SIMPLE: 20k root forms + inflection paradigms (morphology + syntax + semantics)
  • LUSOlex: 65k root forms (morphology + gramcat)
  • BRASILex: 68k root forms (morphology + gramcat)
  • Integração do LUSOlex + EPLexIC: ~8-10x EPLexIC phonetic forms
  • DicPro: 6.2k anthroponyms
  • SMorph: 26k root forms (morphology + inflection paradigm)
  • EPLexIC: 80k word forms (morphology + pronunciation); in construction
  • ONOMASTICA: 85k proper names (people, streets, cities, companies); 11 languages and cross-lingual information; pronunciation
  • Broadcast News: 64k entries (pronunciation)

Corpora

Information available at http://corpora.l2f.inesc-id.pt/ (intranet)

  • CETENFolha: 24Mwords (newspaper corpus)
  • CETEMPúblico: 180 Mwords (newspaper corpus)
  • CHInf: 100 children stories (books)
  • Newspapers: 10 daily newspapers; ~600 Mwords
  • PAROLE: ~20 Mwords

Coffee Break and Welcome Reception

  • 10:30 - Welcome reception by General Carlos Carvalho dos Reis.

Spoken Language Corpora

Information available at http://corpora.l2f.inesc-id.pt/ (intranet)
List of presented corpora:

  • EUROM.1
  • BDFALA: newspapers and TV debates
  • SPEECHDAT
  • CORAL
  • ALERT-ASR
  • ALERT-TD: TV broadcast news
  • IPSOM: six spoken books (read by professionals); discussion regarding publication and distribution rights
  • LECTRA: classroom lectures (pilot corpus); two semesters (under construction)
  • PAPOUS: corpus of children stories performed by António Rito Silva (falsetto voice)

Simple Text Processing Tools

  • Morphological analysis
    • SMorph - POS tagger, tokenizer, generator
    • Palavroso - POS tagger, tokenizer, generator
    • Amorfo/XA - POS tagger, tokenizer, simultaneous multi-lingual analysis; error-correction (spelling correction)
  • Morphological generation
    • Monge - general form generator; language- and tag-independent; uses LRDB (under development; usable)
    • Gover - verb gnerator (~10k manually corrected verbs)
  • Morpho-syntax processing
    • PAsMo - rule-based rewriter
    • MARv - morpho-syntactic disambiguation
  • Syntactic analysis
    • SuSAna -
    • ParVO - syntactic analyzer (Earley algorithm; variable unification; O(n³))
  • Syntax-Semantics interface
    • Algas - arrowing construction
    • AsDeCopas -
  • Other tools
    • text2syl - silabification
    • num2ext - text normalizer
    • YAH - (yet another) hyphenator (rule-based); MS Office compatible
    • Correcto - spell checker; MS Office compatible
    • leia -
  • General purpose
    • FSTK lib - finite-state transduce toolkit