HLT@INESC-ID - User contributions [en]

File:Tipo-passe-javg.png

2012-03-29T09:40:29Z

Javg: uploaded a new version of "File:Tipo-passe-javg.png"

João Graça

2012-02-13T21:42:21Z

Javg:

{{infobox|name=João Graça
|username=javg
|contact=javg
|phone=+351-213-100-351
|fax=+351-213-145-843
}}

João Graça is currently a research scientist at FlashGroup and a researcher at Inesc-ID. Previously he was a
post doctoral researcher at the University of Pennsylvania working under the supervision of Prof. Ben Taskar. He obtained his PhD in Computer Science Engineering at Instituto Superior Técnico, Technical University of Lisbon, where he was advised jointly by Luisa Coheur, Fernando Pereira and Ben Taskar. His main research interest are Machine Learning and Natural Language Processing. Currently his research focus on unsupervised learning with high level supervision in the form of domain specific prior knowledge, and on the utility of unsupervised methods for real world applications.

== Research Interests ==

* Machine Learning
* Natural Language Processing
* Statistical Machine Translation
* Software Engineering applied to NLP

== Ongoing Projects ==
* Toolkit for word alignments
** Constrained Alignment Toolkit [http://www.seas.upenn.edu/~strctlrn/CAT/CAT.html]
* Gold Standard of word alignments
** [[Word_Alignments|Golden collection of parallel multi-language word alignments]]

== Finished Projects ==
* A computational framework for a Natural Language course

<inesc-id what='person' id='542'></inesc-id>

[[category:People]]
[[category:Graduate Students]]

João Graça

2012-02-13T21:39:21Z

Javg:

{{infobox|name=João Graça
|username=javg
|contact=javg
|phone=+351-213-100-351
|fax=+351-213-145-843
}}

João Graça is currently a research scientist at FlashGroup and a researcher at Inesc-ID. Previously he was a
post doctoral researcher at the University of Pennsylvania working under the supervision of Prof. Ben Taskar. He obtained his PhD in Computer Science Engineering at Instituto Superior Técnico, Technical University of Lisbon, where he was advised jointly by Luisa Coheur, Fernando Pereira and Ben Taskar. His main research interest are Machine Learning and Natural Language Processing. Currently his research focus on unsupervised learning with high level supervision in the form of domain specific prior knowledge, and on the utility of unsupervised methods for real world applications.

== Research Interests ==

* Statistical Machine Translation
* Word Alignment
* Machine Learning
* Natural Language Processing
* Software Engineering applied to NLP

== Ongoing Projects ==
* Toolkit for word alignments
** Constrained Alignment Toolkit [http://www.seas.upenn.edu/~strctlrn/CAT/CAT.html]
* Gold Standard of word alignments
** [[Word_Alignments|Golden collection of parallel multi-language word alignments]]

== Finished Projects ==
* A computational framework for a Natural Language course

<inesc-id what='person' id='542'></inesc-id>

[[category:People]]
[[category:Graduate Students]]

People

2011-07-28T13:50:35Z

Javg:

__NOTOC__

== Researchers ==

{| width="100%"
! style="font-weight: normal; text-align: left; width: 25%;" | [[Isabel Trancoso|Isabel M. Trancoso]]
! style="font-weight: normal; text-align: left; width: 25%;" | [[Hugo Meinedo]]
| style="font-weight: normal; text-align: left; width: 25%;" | [[Luísa Coheur]]
| style="font-weight: normal; text-align: left; width: 25%;" | [[Nuno Mamede]]
|-
| [[David Martins de Matos]]
| [[João Paulo Neto]]
| [[Luís Caldas de Oliveira|Luís C. Oliveira]]
| [[António Serralheiro|António J. Serralheiro]]
|-
| [[Ramón Astudillo]]
| [[Alberto Abad Gareta]]
| [[Thomas Pellegrini]]
| [[Joao Paulo Carvalho]]
|-
| [[Miguel Bugalho]]
| [[Anabela Barreiro]]
| [[João Graça]]
|}

== PhD Students ==

{| width="100%"
! style="font-weight: normal; text-align: left; width: 25%;" | [[Helena Moniz]]
! style="font-weight: normal; text-align: left; width: 25%;" | [[Fernando Batista]]
! style="font-weight: normal; text-align: left; width: 25%;" | [[Isabel Mascarenhas]]
! style="font-weight: normal; text-align: left; width: 25%;" | [[João Miranda]]
|-
| [[Gopala Krishna Anumanchipalli]]
| [[Joana Paulo Pardal]]
| [[Ricardo Daniel Ribeiro|Ricardo Ribeiro]]
| [[Paula Cristina Vaz|Paula Cristina Vaz]]
|-
| [[Filipe Cabecinhas]]
| [[José David Lopes]]
| [[Ana Cristina Mendes]]
| [[Luís Garcia]]
|-
| [[Tiago Luís]]
| [[Luís Marujo]]
| [[Wang Ling]]
| [[Rui Correia]]
|-
| [[Gracinda Carvalho]]
| [[José Portêlo]]
|}

== Research Associates ==

{| width="100%"
! width="25%" style="text-align: left; font-weight: normal;" | [[Diogo Oliveira]]
! width="25%" style="text-align: left; font-weight: normal;" | [[Vera Cabarrão]]
! width="25%" style="text-align: left; font-weight: normal;" | [[Ângela Costa]]
! width="25%" style="text-align: left; font-weight: normal;" | [[Sérgio Curto]]
|
|-
| [[Cláudio Diniz]]
|}

== Associated Researchers ==

{| width="100%"
! width="25%" style="text-align: left; font-weight: normal;" | [[Maria do Céu Viana]]
! width="25%" style="text-align: left; font-weight: normal;" | [[Jorge Baptista]]
! width="25%" style="text-align: left; font-weight: normal;" | [[Gaël Harry Dias]]
! width="25%" style="text-align: left; font-weight: normal;" | [[Diamantino Caseiro|Diamantino A. Caseiro]]
|}

== Masters Students ==

{| width="100%"
! width="25%" style="text-align: left; font-weight: normal;" | [[Bruno Almeida]]
! width="25%" style="text-align: left; font-weight: normal;" | [[Sérgio Gomes]]
! width="25%" style="text-align: left; font-weight: normal;" | [[David Antunes]]
! width="25%" style="text-align: left; font-weight: normal;" | [[Ricardo Pires]]
|-
| [[Sérgio Morais]]
| [[Andreia Maurício]]
| [[Pedro Girão Antunes]]
| [[Zoran Vitez]]
|-
| [[André Gonçalves]]
| [[Ricardo Portela]]
| [[Nuno Nobre]]
| [[João Completo]]
|-
| [[Eugénio Ribeiro]]
|
|}

== Trainees ==

{| width="100%"
! width="25%" style="text-align: left; font-weight: normal;" | [[Jaime Ferreira]]
! width="25%" style="text-align: left; font-weight: normal;" | [[João Colaço]]
! width="25%" style="text-align: left; font-weight: normal;" | [[vahid keshavarz hedayati]]
! width="25%" style="text-align: left; font-weight: normal;" |
|-
|
|}

== Administrative Support ==

{| width="100%"
! style="font-weight: normal; text-align: left; width: 33%;" | [[Teresa Mimoso]]
! style="font-weight: normal; text-align: left; width: 33%;" |
|
|-
|
|
|}

== Former L²F Members ==

* List of [[Former L²F Members]]

João Graça

2009-11-09T17:06:54Z

Javg:

{{infobox|name=João Graça
|username=javg
|contact=javg
|phone=+351-213-100-351
|fax=+351-213-145-843
}}

João Graça graduated in Informatics and Computer Science Engineering in 2002 from Instituto Superior Técnico (IST), Lisbon. He received a Masters Degree in Informatics and Computer Science Engineering in 2006, also from IST, on artificial intelligence with the development of a framework for integrating natural language tools. He was a teaching assistant in IST from (2003) to (2006) where he taught object oriented programming and compilers. He is also researcher at Spoken Language Systems Lab (L2F).

He is currently on the fourth year of his PhD program in collaboration with the University of Pennsylvania where he passed the last years as an invited student. His PhD is focused on machine learning and natural language processing. Currently he has been involved in new learning methods for unsupervised learning with domain specific constraints.

== Research Interests ==

* Statistical Machine Translation
* Word Alignment
* Machine Learning
* Natural Language Processing
* Software Engineering applied to NLP

== Ongoing Projects ==
* Toolkit for word alignments
** Constrained Alignment Toolkit [http://www.seas.upenn.edu/~strctlrn/CAT/CAT.html]
* Gold Standard of word alignments
** [[Word_Alignments|Golden collection of parallel multi-language word alignments]]

== Finished Projects ==
* A computational framework for a Natural Language course

<inesc-id what='person' id='542'></inesc-id>

[[category:People]]
[[category:Graduate Students]]

File:Joaograca.jpg

2009-11-09T17:01:47Z

Javg:

Downloads

2008-07-17T18:00:49Z

Javg: /* Translation */

{{TOCright}}
These are tools and resources made available by the L²F.

== Tools ==

* [http://www.l2f.inesc-id.pt/~lco/eugenio/index.html Eugenio] - Word Predictor for European Portuguese

== Resources ==

=== Translation ===

* [[Word_Alignments|Golden collection of parallel multi-language word alignments]] - Manually annotated word alignments between six european languages taken from the Europarl common test set <br>(more information on the [[Speech-to-speech Translation]] information page)

=== Other ===

* [http://www.l2f.inesc-id.pt/resources/Portug.Dict.sit Portuguese Dictionary] for [http://www.eg.bucknell.edu/~excalibr/excalibur.html Excalibur], a spell checker for the Macintosh.<br/>Assembled by [[Nuno Mamede]] in association with [http://label2.ist.utl.pt/label/ LabEL]

Speech-to-speech Translation

2008-07-17T17:59:50Z

Javg:

{| style="margin-left: 10px; margin-bottom: 10px; width: 120px; font-size: 95%; border-width: 1px; border-style: solid; background: #dee2ff;" cellpadding="4" align="right"
|+ style="font-size: larger;" | '''People'''
|-
| align="center" |<div style="border-style: solid; border-width: 0px; width: 100px;">[[Image:tipo-passe-javg.png|100px|center|]][[João Graça]]</div>
|-
| align="center" |<div style="border-style: solid; border-width: 0px; width: 100px;">[[Image:tipo-passe-lcoheur.png|100px|center|]][[Luísa Coheur]]</div>
|-
| align="center" |<div style="border-style: solid; border-width: 0px; width: 100px;">[[Image:tipo-passe-dcaseiro.png|100px|center|]][[Diamantino Caseiro]]</div>
|-
| align="center" |<div style="border-style: solid; border-width: 0px; width: 100px;">[[Image:tipo-passe-joana.png|100px|center|]][[Joana Paulo Pardal]]</div>
|}
__NOTOC__

Speech-to-speech machine translation is one of the most strategically relevant areas for L2F. The state of the art in speech translation is crucially dependent on the state of the art of several core technologies: speech recognition, machine translation and text-to-speech synthesis (namely in what concerns voice morphing, in order to reproduce the source speakers’ characteristics in the target speaker’s voice). The main limitations of current machine translation systems are the lack of semantic interpretation and world knowledge as well as insufficient coverage of the large proportion of idiosyncratic linguistic phenomena in lexicon and syntax. The most promising approaches combine improved statistical methods with the improved knowledge-driven methods in a variety of clever ways.

The research at L2F started by investing in statistically based speech-to-speech machine translation approaches based on weighted finite state transducers [Picó 2005] [Caseiro 2006], aiming at a tight integration between recognition and translation. WFSTs are especially well suited for combining different type of approaches, whether statistical or knowledge-based. The combination may be advantageous for achieving two different goals (i) include morpho-syntactic linguistic knowledge into the statistical machine translation paradigm and (ii) tackle the data sparseness problem for speech translation. The work was carried out within the scope of a national project on “Weighted Finite State Transducers Applied to Spoken Language Processing”.

In 2007, L2F participated in the 4th International Workshop on Spoken Language Translation [Graça 2007] where a standard combination of phrase based machine translation and translation reranking was used. During the reranking some new features using linguistic information were used, which showed promising results.

The current focus of research is now centered in text statistical machine translation, namely on word alignments, since these are an important starting point for most state of the art statistical machine translation systems. As so, a new algorithm that presents state of the art results was developed in cooperation with the University of Pennsylvania [Graça 2007, Ganchev 2008]. Also, a guideline for building manual alignments between different language pairs was proposed, along with the gold alignments for six different European languages pairs [Graça 2008]. This can be a valuable resource both for evaluating/tuning word alignment models.

In September 2008, the Machine Translation team is going to be augmented with four Master students.
== Related Resources ==

* [[Word_Alignments|Golden collection of parallel multi-language word alignments]] - Manually annotated word alignments between six european languages taken from the Europarl common test set

== Related Software ==

* Constrained Alignment Toolkit (CAT) - Word Alignment Toolkit produced in cooperation with the University of Pennsylvania. Please see [http://www.seas.upenn.edu/~strctlrn/CAT/CAT.html official web site]

== Demos ==

A demonstration of tightly integrated [http://www.l2f.inesc-id.pt/projects/wfst/demo/trans-pc.smi speech-to-text] translation is available.
The translation module is implemented as a single WFST that is used as the language model in the speech recognizer. This architecture produces sentences in the target language directly from source language speech.

A demonstration of [http://www.l2f.inesc-id.pt/projects/wfst/demo-bnews/2006_10_19-19_59_01_en.smi large vocabulary translation] is also available.
The output of the WFST-based speech recognition module was translated using
a WFST-based machine translation module trained in the European Parliament domain.

Recent demos:

[http://www.l2f.inesc-id.pt/projects/wfst/DEMO_pt_es_en/demo_pt_es_en.smi Broadcast News translation from Portuguese to Spanish and English]

[http://www.l2f.inesc-id.pt/projects/wfst/ES_DEMO/chile/chile.smi Broadcast News translation from South American Spanish to Portuguese]

== Finished Projects ==

* [[WFST (Weighted Finite State Transducers Applied to Spoken Language Processing)|WFST]] - Weighted Finite State Transducers Applied to Spoken Language Processing (2004-2007)

== Selected Publications ==

* Kuzman Ganchev, João de Almeida Varelas Graça, Ben Taskar, [http://www.inesc-id.pt/pt/indicadores/Ficheiros/4813.pdf Better Alignments = Better Translations?], In ACL-08: HLT, Association for Computational Linguistics, pages 986-993, June 2008

* João de Almeida Varelas Graça, Joana Paulo Pardal, Luísa Coheur, Diamantino António Caseiro, [http://www.inesc-id.pt/pt/indicadores/Ficheiros/4735.pdf Building a golden collection of parallel Multi-Language Word Alignment], In The 6th International Conference on Language Resources and Evaluation, LREC 2008, May 2008

* João de Almeida Varelas Graça, Kuzman Ganchev, Ben Taskar, [http://www.inesc-id.pt/pt/indicadores/Ficheiros/4306.pdf Expectation Maximization and Posterior Constraints], In Neural Information Processing Systems Conference (NIPS), December 2007

* João de Almeida Varelas Graça, Diamantino António Caseiro, Luísa Coheur, [http://www.mt-archive.info/IWSLT-2007-Graca.pdf The INESC-ID IWSLT07 SMT System], In Proceedings of IWSLT International Workshop on Spoken Language Translation, pages 125-130, October 2007 ([http://iwslt07.itc.it/menu/presentations/sysSession4/INESC-ID.pdf slides pdf])

* Diamantino António Caseiro, Isabel Trancoso, Weighted Finite-State Transducer Inference for Limited-Domain Speech-to-Speech Translation, In Computational Processing of the Portuguese Language: 7th International Workshop, PROPOR 2006, Springer, pages 60 - 68, May 2006

* D. Picó, J. González, F. Casacuberta, Diamantino António Caseiro, Isabel Trancoso, Finite-state transducer inference for a speech-input Portuguese-to-English machine translation system, In Interspeech 2005, September 2005

Speech-to-speech Translation

2008-07-16T16:12:40Z

Javg:

{| style="margin-left: 10px; margin-bottom: 10px; width: 120px; font-size: 95%; border-width: 1px; border-style: solid; background: #dee2ff;" cellpadding="4" align="right"
|+ style="font-size: larger;" | '''People'''
|-
| align="center" |<div style="border-style: solid; border-width: 0px; width: 100px;">[[Image:tipo-passe-javg.png|100px|center|]][[João Graça]]</div>
|-
| align="center" |<div style="border-style: solid; border-width: 0px; width: 100px;">[[Image:tipo-passe-lcoheur.png|100px|center|]][[Luísa Coheur]]</div>
|-
| align="center" |<div style="border-style: solid; border-width: 0px; width: 100px;">[[Image:tipo-passe-dcaseiro.png|100px|center|]][[Diamantino Caseiro]]</div>
|-
| align="center" |<div style="border-style: solid; border-width: 0px; width: 100px;">[[Image:tipo-passe-joana.png|100px|center|]][[Joana Paulo Pardal]]</div>
|}
__NOTOC__

Speech-to-speech machine translation is one of the most strategically relevant areas for L2F. The state of the art in speech translation is crucially dependent on the state of the art of several core technologies: speech recognition, machine translation and, to a lesser extent, text-to-speech synthesis (namely in what concerns voice morphing, in order to reproduce the source speakers’ characteristics in the target speaker’s voice). The main limitations of current machine translation systems are the lack of semantic interpretation and world knowledge as well as insufficient coverage of the large proportion of idiosyncratic linguistic phenomena in lexicon and syntax. The most promising approaches combine improved statistical methods with the improved knowledge-driven methods in a variety of clever ways.

The research at L2F started by investing in statistically based speech-to-speech machine translation approaches based on weighted finite state transducers [Picó 2005] [Caseiro 2006], aiming at a tight integration between recognition and translation. WFSTs are especially well suited for combining different type of approaches, whether statistical or knowledge-based. The combination may be advantageous for achieving two different goals (i) include morpho-syntactic linguistic knowledge into the statistical machine translation paradigm and (ii) tackle the data sparseness problem for speech translation. The work was carried out within the scope of a national project on “Weighted Finite State Transducers Applied to Spoken Language Processing”.

In 2007, L2F participated in the 4th International Workshop on Spoken Language Translation [Graça 2007] where a standard combination of phrase based machine translation and translation reranking was used. During the reranking some new features using linguistic information were used, which showed promising results.

The current focus of research is now centered in text statistical machine translation, namely on word alignments, since these are an important starting point for most state of the art statistical machine translation systems. As so, a new algorithm that presents state of the art results was developed in cooperation with the University of Pennsylvania [Graça 2007, Ganchev 2008]. Also, a guideline for building manual alignments between different language pairs was proposed, along with the gold alignments for six different European languages pairs [Graça 2008]. This can be a valuable resource both for evaluating/tuning word alignment models.

Currently the Machine Translation Team is composed of a senior researcher and a PhD student. In September 2008, the team is going to be augmented with four Master students.
== Related Resources ==

* [[Word_Alignments|Golden collection of parallel multi-language word alignments]] - Manually annotated word alignments between six european languages taken from the Europarl common test set

== Related Software ==

* Constrained Alignment Toolkit (CAT) - Word Alignment Toolkit produced in cooperation with the University of Pennsylvania. Please see [http://www.seas.upenn.edu/~strctlrn/CAT/CAT.html official web site]

== Demos ==

* See [[WFST (Weighted Finite State Transducers Applied to Spoken Language Processing)|WFST]] [http://www.l2f.inesc-id.pt/wiki/index.php/WFST_-_Weighted_Finite_State_Transducers_Applied_to_Spoken_Language_Processing#Demos demos page]

== Finished Projects ==

* [[WFST (Weighted Finite State Transducers Applied to Spoken Language Processing)|WFST]] - Weighted Finite State Transducers Applied to Spoken Language Processing (2004-2007)

== Selected Publications ==

* Kuzman Ganchev, João de Almeida Varelas Graça, Ben Taskar, [http://www.inesc-id.pt/pt/indicadores/Ficheiros/4813.pdf Better Alignments = Better Translations?], In ACL-08: HLT, Association for Computational Linguistics, pages 986-993, June 2008

* João de Almeida Varelas Graça, Joana Paulo Pardal, Luísa Coheur, Diamantino António Caseiro, [http://www.inesc-id.pt/pt/indicadores/Ficheiros/4735.pdf Building a golden collection of parallel Multi-Language Word Alignment], In The 6th International Conference on Language Resources and Evaluation, LREC 2008, May 2008

* João de Almeida Varelas Graça, Kuzman Ganchev, Ben Taskar, [http://www.inesc-id.pt/pt/indicadores/Ficheiros/4306.pdf Expectation Maximization and Posterior Constraints], In Neural Information Processing Systems Conference (NIPS), December 2007

* João de Almeida Varelas Graça, Diamantino António Caseiro, Luísa Coheur, [http://www.mt-archive.info/IWSLT-2007-Graca.pdf The INESC-ID IWSLT07 SMT System], In Proceedings of IWSLT International Workshop on Spoken Language Translation, pages 125-130, October 2007 ([http://iwslt07.itc.it/menu/presentations/sysSession4/INESC-ID.pdf slides pdf])

* Diamantino António Caseiro, Isabel Trancoso, Weighted Finite-State Transducer Inference for Limited-Domain Speech-to-Speech Translation, In Computational Processing of the Portuguese Language: 7th International Workshop, PROPOR 2006, Springer, pages 60 - 68, May 2006

* D. Picó, J. González, F. Casacuberta, Diamantino António Caseiro, Isabel Trancoso, Finite-state transducer inference for a speech-input Portuguese-to-English machine translation system, In Interspeech 2005, September 2005

João Graça

2008-07-16T15:23:02Z

Javg:

{{infobox|name=João Graça
|username=javg
|contact=javg
|phone=+351-213-100-351
|fax=+351-213-145-843
}}

João Graça graduated in Informatics and Computer Science Engineering in 2002 from Instituto Superior Técnico (IST), Lisbon. He received a Masters Degree in Informatics and Computer Science Engineering in 2006, also from IST, on artificial intelligence with the development of a framework for integrating natural language tools. He was a teaching assistant in IST from (2003) to (2006) where he taught object oriented programming and compilers. He is also researcher at Spoken Language Systems Lab (L2F).

He is currently on the second year of his PhD program in collaboration with the University of Pennsylvania where he passed the last two years as an invited student. His PhD is focused on statistical machine translation. Currently he has been involved in new learning methods for unsupervised learning applied to automatically word alignments and their impact on the output of end-to-end machine translation system.

== Research Interests ==

* Statistical Machine Translation
* Word Alignment
* Machine Learning
* Software Engineering applied to NLP

== Ongoing Projects ==
* Toolkit for word alignments
** Constrained Alignment Toolkit [http://www.seas.upenn.edu/~strctlrn/CAT/CAT.html]
* Gold Standard of word alignments
** [[Word_Alignments|Golden collection of parallel multi-language word alignments]]

== Finished Projects ==
* A computational framework for a Natural Language course

<inesc-id what='person' id='542'></inesc-id>

[[category:People]]
[[category:Graduate Students]]

Speech-to-speech Translation

2008-06-30T15:59:32Z

Javg:

{| style="margin-left: 10px; margin-bottom: 10px; width: 120px; font-size: 95%; border-width: 1px; border-style: solid; background: #dee2ff;" cellpadding="4" align="right"
|+ style="font-size: larger;" | '''People'''
|-
| align="center" |<div style="border-style: solid; border-width: 0px; width: 100px;">[[Image:tipo-passe-javg.png|100px|center|]][[João Graça]]</div>
|-
| align="center" |<div style="border-style: solid; border-width: 0px; width: 100px;">[[Image:tipo-passe-lcoheur.png|100px|center|]][[Luísa Coheur]]</div>
|-
| align="center" |<div style="border-style: solid; border-width: 0px; width: 100px;">[[Image:tipo-passe-dcaseiro.png|100px|center|]][[Diamantino Caseiro]]</div>
|-
| align="center" |<div style="border-style: solid; border-width: 0px; width: 100px;">[[Image:tipo-passe-joana.png|100px|center|]][[Joana Paulo Pardal]]</div>
|}

Speech-to-speech machine translation is one of the most strategically relevant areas for L2F.
The state of the art in speech translation is crucially dependent on the state of the art of several core technologies: speech recognition, machine translation and, to a lesser extent, text-to-speech synthesis (namely in what concerns voice morphing, in order to reproduce the source speakers’ characteristics in the target speaker’s voice). The main limitations of current machine translation systems are the lack of semantic interpretation and world knowledge as well as insufficient coverage of the large proportion of idiosyncratic linguistic phenomena in lexicon and syntax. The most promising approaches combine improved statistical methods with the improved knowledge-driven methods in a variety of clever ways.

L2F has been investing in statistically based speech-to-speech machine translation approaches based on weighted finite state transducers [Picó 2005] [Caseiro 2006], aiming at a tight integration between recognition and translation. WFSTs are especially well suited for combining different type of approaches, whether statistical or knowledge-based. The combination may be advantageous for achieving two different goals (i) include morpho-syntactic linguistic knowledge into the statistical machine translation paradigm and (ii) tackle the data sparseness problem for speech translation.

This research is carried out within the scope of a national project on “Weighted Finite State Transducers Applied to Spoken Language Processing”. Two PhD theses have recently started in this area.

== Related Resources ==

* [[Word_Alignments|Golden collection of parallel multi-language word alignments]] - Manually annotated word alignments between six european languages taken from the Europarl common test set

== Related Software ==

* Constrained Alignment Toolkit (CAT) - Word Alignment Toolkit produced in cooperation with the University of Pennsylvania. Please see official web site [[http://www.seas.upenn.edu/~strctlrn/CAT/CAT.html]]

== Demos ==

* See demos page [[http://www.l2f.inesc-id.pt/wiki/index.php/WFST_-_Weighted_Finite_State_Transducers_Applied_to_Spoken_Language_Processing#Demos]]

== Finished Projects ==

* [[WFST (Weighted Finite State Transducers Applied to Spoken Language Processing)|WFST]] - Weighted Finite State Transducers Applied to Spoken Language Processing (2004-2007)

== Selected Publications ==

* Kuzman Ganchev, João de Almeida Varelas Graça, Ben Taskar, [http://www.inesc-id.pt/pt/indicadores/Ficheiros/4813.pdf Better Alignments = Better Translations?], In ACL-08: HLT, Association for Computational Linguistics, pages 986-993, June 2008

* João de Almeida Varelas Graça, Joana Paulo Pardal, Luísa Coheur, Diamantino António Caseiro, [http://www.inesc-id.pt/pt/indicadores/Ficheiros/4735.pdf Building a golden collection of parallel Multi-Language Word Alignment], In The 6th International Conference on Language Resources and Evaluation, LREC 2008, May 2008

* João de Almeida Varelas Graça, Kuzman Ganchev, Ben Taskar, [http://www.inesc-id.pt/pt/indicadores/Ficheiros/4306.pdf Expectation Maximization and Posterior Constraints], In Neural Information Processing Systems Conference (NIPS), December 2007

* João de Almeida Varelas Graça, Diamantino António Caseiro, Luísa Coheur, [http://www.mt-archive.info/IWSLT-2007-Graca.pdf The INESC-ID IWSLT07 SMT System], In Proceedings of IWSLT International Workshop on Spoken Language Translation, pages 125-130, October 2007 ([http://iwslt07.itc.it/menu/presentations/sysSession4/INESC-ID.pdf slides pdf])

* Diamantino António Caseiro, Isabel Trancoso, Weighted Finite-State Transducer Inference for Limited-Domain Speech-to-Speech Translation, In Computational Processing of the Portuguese Language: 7th International Workshop, PROPOR 2006, Springer, pages 60 - 68, May 2006

* D. Picó, J. González, F. Casacuberta, Diamantino António Caseiro, Isabel Trancoso, Finite-state transducer inference for a speech-input Portuguese-to-English machine translation system, In Interspeech 2005, September 2005

Speech-to-speech Translation

2008-06-30T15:56:34Z

Javg: /* Related Software */

{| style="margin-left: 10px; margin-bottom: 10px; width: 120px; font-size: 95%; border-width: 1px; border-style: solid; background: #dee2ff;" cellpadding="4" align="right"
|+ style="font-size: larger;" | '''People'''
|-
| align="center" |<div style="border-style: solid; border-width: 0px; width: 100px;">[[Image:tipo-passe-javg.png|100px|center|]][[João Graça]]</div>
|-
| align="center" |<div style="border-style: solid; border-width: 0px; width: 100px;">[[Image:tipo-passe-lcoheur.png|100px|center|]][[Luísa Coheur]]</div>
|-
| align="center" |<div style="border-style: solid; border-width: 0px; width: 100px;">[[Image:tipo-passe-dcaseiro.png|100px|center|]][[Diamantino Caseiro]]</div>
|-
| align="center" |<div style="border-style: solid; border-width: 0px; width: 100px;">[[Image:tipo-passe-joana.png|100px|center|]][[Joana Paulo Pardal]]</div>
|}

Speech-to-speech machine translation is one of the most strategically relevant areas for L2F.
The state of the art in speech translation is crucially dependent on the state of the art of several core technologies: speech recognition, machine translation and, to a lesser extent, text-to-speech synthesis (namely in what concerns voice morphing, in order to reproduce the source speakers’ characteristics in the target speaker’s voice). The main limitations of current machine translation systems are the lack of semantic interpretation and world knowledge as well as insufficient coverage of the large proportion of idiosyncratic linguistic phenomena in lexicon and syntax. The most promising approaches combine improved statistical methods with the improved knowledge-driven methods in a variety of clever ways.

L2F has been investing in statistically based speech-to-speech machine translation approaches based on weighted finite state transducers [Picó 2005] [Caseiro 2006], aiming at a tight integration between recognition and translation. WFSTs are especially well suited for combining different type of approaches, whether statistical or knowledge-based. The combination may be advantageous for achieving two different goals (i) include morpho-syntactic linguistic knowledge into the statistical machine translation paradigm and (ii) tackle the data sparseness problem for speech translation.

This research is carried out within the scope of a national project on “Weighted Finite State Transducers Applied to Spoken Language Processing”. Two PhD theses have recently started in this area.

== Related Resources ==

* [[Word_Alignments|Golden collection of parallel multi-language word alignments]] - Manually annotated word alignments between six european languages taken from the Europarl common test set

== Related Software ==

* Constrained Alignment Toolkit (CAT) - Word Alignment Toolkit produced in cooperation with the University of Pennsylvania. Please see official web site [[http://www.seas.upenn.edu/~strctlrn/CAT/CAT.html]]

== Finished Projects ==

* [[WFST (Weighted Finite State Transducers Applied to Spoken Language Processing)|WFST]] - Weighted Finite State Transducers Applied to Spoken Language Processing (2004-2007)

== Selected Publications ==

* Kuzman Ganchev, João de Almeida Varelas Graça, Ben Taskar, [http://www.inesc-id.pt/pt/indicadores/Ficheiros/4813.pdf Better Alignments = Better Translations?], In ACL-08: HLT, Association for Computational Linguistics, pages 986-993, June 2008

* João de Almeida Varelas Graça, Joana Paulo Pardal, Luísa Coheur, Diamantino António Caseiro, [http://www.inesc-id.pt/pt/indicadores/Ficheiros/4735.pdf Building a golden collection of parallel Multi-Language Word Alignment], In The 6th International Conference on Language Resources and Evaluation, LREC 2008, May 2008

* João de Almeida Varelas Graça, Kuzman Ganchev, Ben Taskar, [http://www.inesc-id.pt/pt/indicadores/Ficheiros/4306.pdf Expectation Maximization and Posterior Constraints], In Neural Information Processing Systems Conference (NIPS), December 2007

* João de Almeida Varelas Graça, Diamantino António Caseiro, Luísa Coheur, [http://www.mt-archive.info/IWSLT-2007-Graca.pdf The INESC-ID IWSLT07 SMT System], In Proceedings of IWSLT International Workshop on Spoken Language Translation, pages 125-130, October 2007 ([http://iwslt07.itc.it/menu/presentations/sysSession4/INESC-ID.pdf slides pdf])

* Diamantino António Caseiro, Isabel Trancoso, Weighted Finite-State Transducer Inference for Limited-Domain Speech-to-Speech Translation, In Computational Processing of the Portuguese Language: 7th International Workshop, PROPOR 2006, Springer, pages 60 - 68, May 2006

* D. Picó, J. González, F. Casacuberta, Diamantino António Caseiro, Isabel Trancoso, Finite-state transducer inference for a speech-input Portuguese-to-English machine translation system, In Interspeech 2005, September 2005

Speech-to-speech Translation

2008-06-30T15:54:18Z

Javg:

Speech-to-speech Translation

2008-06-30T15:51:23Z

Javg:

Word Alignments

2008-06-30T15:45:12Z

Javg:

[[Image:gold-alignment.png|200px|left|]]

Manually annotated word alignments for six different language pairs.
* Portuguese - English
* Portuguese - French
* Portuguese - Spanish
* English - Spanish
* English - French
* French - Spanish
Please cite the following paper in case of using the corpus:
: João de Almeida Varelas Graça, Joana Paulo Pardal, Luísa Coheur, Diamantino António Caseiro, [http://www.inesc-id.pt/pt/indicadores/Ficheiros/4735.pdf Building a golden collection of parallel Multi-Language Word Alignment], In The 6th International Conference on Language Resources and Evaluation, LREC 2008, May 2008

== Contents ==
The corpus is taken from the publicly available [http://www.statmt.org/europarl/ Europarl Corpus] that contains proceedings of the European parliament in the different official languages.
The golden collection is built over the first 100 sentences of the common test taken from Q4/2000 portion of the data (2000-10 to 2000-12). The common test set can be download from [http://www.statmt.org/europarl/archives.html Europarl archives]. The common test set is already tokenized and lowercased.

== Guidelines ==
Guidelines followed to produce the manual word alignments over six different language pairs (all combinations between Portuguese, English, French and Spanish) ([http://www.inesc-id.pt/pt/indicadores/Ficheiros/4734.pdf PDF]).

== Download ==
'''[http://www.l2f.inesc-id.pt/resources/translation/golden_collection.zip Golden collection of parallel multi-language word alignments]'''

(more information on the [[Speech-to-speech Translation]] information page)

Word Alignments

2008-06-30T15:41:54Z

Javg:

[[Image:gold-alignment.png|100px|center|]]

Manually annotated word alignments for six different language pairs.
* Portuguese - English
* Portuguese - French
* Portuguese - Spanish
* English - Spanish
* English - French
* French - Spanish
Please cite the following paper in case of using the corpus:
: João de Almeida Varelas Graça, Joana Paulo Pardal, Luísa Coheur, Diamantino António Caseiro, [http://www.inesc-id.pt/pt/indicadores/Ficheiros/4735.pdf Building a golden collection of parallel Multi-Language Word Alignment], In The 6th International Conference on Language Resources and Evaluation, LREC 2008, May 2008

== Contents ==
The corpus is taken from the publicly available [http://www.statmt.org/europarl/ Europarl Corpus] that contains proceedings of the European parliament in the different official languages.
The golden collection is built over the first 100 sentences of the common test taken from Q4/2000 portion of the data (2000-10 to 2000-12). The common test set can be download from [http://www.statmt.org/europarl/archives.html Europarl archives]. The common test set is already tokenized and lowercased.

== Guidelines ==
Guidelines followed to produce the manual word alignments over six different language pairs (all combinations between Portuguese, English, French and Spanish) ([http://www.inesc-id.pt/pt/indicadores/Ficheiros/4734.pdf PDF]).

== Download ==
'''[http://www.l2f.inesc-id.pt/resources/translation/golden_collection.zip Golden collection of parallel multi-language word alignments]'''

(more information on the [[Speech-to-speech Translation]] information page)

File:Gold-alignment.png

2008-06-30T15:39:45Z

Javg: A example of a manual word alignment

A example of a manual word alignment

Resources

2008-06-25T18:27:17Z

Javg: /* Translation */

{{TOCright}}
L²F has been particularly active in the creation of linguistic resources for European Portuguese. The cooperation with CLUL has been of paramount importance in this activity. The resources listed are in inverse chronological order. The corresponding webpages are in Portuguese.

== Corpora ==

=== Speech ===

* [[LECTRA Corpus|LECTRA]] - Classroom lectures
* [[IPSOM Pilot Corpus|IPSOM]] - Aligned spoken books
* [[ALERT Corpus|ALERT]] - Broadcast news
* [[CORAL Corpus|CORAL]] - Spoken dialogues (map task)
* [[BD-PÚBLICO Corpus|BD-PÚBLICO]]- Large vocabulary, speaker-independent, continuous speech
* [[SPEECHDAT Corpus|SPEECHDAT]] - Multi-purpose telephone speech database
* [[BDFALA Corpus|BDFALA]] - Speech analysis / synthesis
* [[EUROM.1 Corpus|EUROM.1]] - Multi-Lingual speech corpus for phonetic comparison

=== Bilingual Corpus ===

* [[Word_Alignments|Golden collection of parallel multi-language word alignments]] - Manually annotated word alignments between six european languages taken from the Europarl common test set

== Lexica ==
Pronunciation lexica (besides the ones included in the above corpora documentation):
* '''ONOMASTICA''' (Proper names of 11 European languages, in cooperation with TLP - Telefones de Lisboa e Porto): ~ 100.000 names of people, streets, towns and companies
* '''PF''' (Português Fundamental): ~ 26.000 citation forms

The pronunciation lexica developed by L²F use the SAMPA phonetic alphabet. See the [[SAMPA Table for European Portuguese|SAMPA table for European Portuguese]] and some comments about its design.

== See Also ==

* [[Resource Links]]

=== Newspapers ===
* [http://www.ims.uni-stuttgart.de/info/Newspapers.html List of Newspapers on the Internet] produced by [[Isabel Trancoso]] and maintained jointly with IMS Stuttgart.

=== Language Resource Centers ===
* [http://www.linguateca.pt Linguateca] (Distributed language resource center for Portuguese)
* [http://www.icp.grenet.fr/ELRA/home.html ELRA] (European Language Resources Association)
* [http://morph.ldc.upenn.edu/ LDC] (Linguistic Data Consortium)

=== Dictionaries ===
* [http://crnvmc.cern.ch/FIND/DICTIONARY? English/Technical Dictionary]
* [gopher://uts.mcc.ac.uk/77/gopherservices/enquire.english American English Dictionary]
* [gopher://gopher.princeton.edu:5003/7 Webster's Dictionary]
* [gopher://info.mcc.ac.uk/77/miscellany/acronyms/.index/index Acronyms Dictionary]
* [http://www.fmi.uni-passau.de/htbin/lt/lte English-German Dictionary]
* [http://www.fmi.uni-passau.de/htbin/lt/ltd German-English Dictionary]
* [http://nova.sti.nasa.gov/nasa-thesaurus.html NASA Thesaurus]

[[category:Resources]]

Resources

2008-06-25T18:26:20Z

Javg:

{{TOCright}}
L²F has been particularly active in the creation of linguistic resources for European Portuguese. The cooperation with CLUL has been of paramount importance in this activity. The resources listed are in inverse chronological order. The corresponding webpages are in Portuguese.

== Corpora ==

=== Speech ===

* [[LECTRA Corpus|LECTRA]] - Classroom lectures
* [[IPSOM Pilot Corpus|IPSOM]] - Aligned spoken books
* [[ALERT Corpus|ALERT]] - Broadcast news
* [[CORAL Corpus|CORAL]] - Spoken dialogues (map task)
* [[BD-PÚBLICO Corpus|BD-PÚBLICO]]- Large vocabulary, speaker-independent, continuous speech
* [[SPEECHDAT Corpus|SPEECHDAT]] - Multi-purpose telephone speech database
* [[BDFALA Corpus|BDFALA]] - Speech analysis / synthesis
* [[EUROM.1 Corpus|EUROM.1]] - Multi-Lingual speech corpus for phonetic comparison

=== Translation ===

* [[Word_Alignments|Golden collection of parallel multi-language word alignments]] - Manually annotated word alignments between six european languages taken from the Europarl common test set

== Lexica ==
Pronunciation lexica (besides the ones included in the above corpora documentation):
* '''ONOMASTICA''' (Proper names of 11 European languages, in cooperation with TLP - Telefones de Lisboa e Porto): ~ 100.000 names of people, streets, towns and companies
* '''PF''' (Português Fundamental): ~ 26.000 citation forms

The pronunciation lexica developed by L²F use the SAMPA phonetic alphabet. See the [[SAMPA Table for European Portuguese|SAMPA table for European Portuguese]] and some comments about its design.

== See Also ==

* [[Resource Links]]

=== Newspapers ===
* [http://www.ims.uni-stuttgart.de/info/Newspapers.html List of Newspapers on the Internet] produced by [[Isabel Trancoso]] and maintained jointly with IMS Stuttgart.

=== Language Resource Centers ===
* [http://www.linguateca.pt Linguateca] (Distributed language resource center for Portuguese)
* [http://www.icp.grenet.fr/ELRA/home.html ELRA] (European Language Resources Association)
* [http://morph.ldc.upenn.edu/ LDC] (Linguistic Data Consortium)

=== Dictionaries ===
* [http://crnvmc.cern.ch/FIND/DICTIONARY? English/Technical Dictionary]
* [gopher://uts.mcc.ac.uk/77/gopherservices/enquire.english American English Dictionary]
* [gopher://gopher.princeton.edu:5003/7 Webster's Dictionary]
* [gopher://info.mcc.ac.uk/77/miscellany/acronyms/.index/index Acronyms Dictionary]
* [http://www.fmi.uni-passau.de/htbin/lt/lte English-German Dictionary]
* [http://www.fmi.uni-passau.de/htbin/lt/ltd German-English Dictionary]
* [http://nova.sti.nasa.gov/nasa-thesaurus.html NASA Thesaurus]

[[category:Resources]]

Word Alignments

2008-06-25T18:19:18Z

Javg:

Manually annotated word alignments for six different language pairs.

* Portuguese - English
* Portuguese - French
* Portuguese - Spanish
* English - Spanish
* English - French
* French - Spanish

Please cite the following paper in case of using the corpus:

João de Almeida Varelas Graça, Joana Paulo Pardal, Luísa Coheur, Diamantino António Caseiro, [http://www.inesc-id.pt/pt/indicadores/Ficheiros/4735.pdf Building a golden collection of parallel Multi-Language Word Alignment], In The 6th International Conference on Language Resources and Evaluation, LREC 2008, May 2008

== Contents ==

The corpus is taken from the publicly available [http://www.statmt.org/europarl/ Europarl Corpus] that contains proceedings of the European parliament in the different official languages.
The golden collection is built over the first 100 sentences of the common test taken from Q4/2000 portion of the data (2000-10 to 2000-12). The common test set can be download from [http://www.statmt.org/europarl/archives.html Europarl archives]. The common test set is already tokenized and lowercased.

== Guidelines ==

Guidelines followed to produce the manual word alignments over six different language pairs (all combinations between Portuguese, English, French and Spanish) ([http://www.inesc-id.pt/pt/indicadores/Ficheiros/4734.pdf PDF]).

== Download ==

'''[http://www.l2f.inesc-id.pt/resources/translation/golden_collection.zip Golden collection of parallel multi-language word alignments]'''