In Portuguese, as in other languages, the pronunciation of a word can depend on the word class (or part-of-speech, lexical tag, morphossyntatic class, etc). For example, the word "almoço" is pronounced "almoço" (closed "o") if used as a noun, and pronounced "alMOço" (opened "o") if used as a verb. The same happens with the word "object" in English. "OBject" if used as a noun and "obJECT" if used as a verb. Thus knowing the part-of-speech can help the system produce correct pronunciations.
The focus of this work is the development of a part-of-speech tagger for Portuguese.
In the development of the tagger, we compared two approaches: a probabilistic-based approach and a hybrid approach. The first one was aimed at integration within the Portuguese version of the Festival system. Festival is a modular freely available TTS system developed at the University of Edinburgh. In this multilingual system, the morphological analysis component is totally lexicon based, and the part-of-speech tagging algorithm is a language independent n-gram based trainable tool. This tool is based on Hidden Markov Models (HMMs) and uses the Viterbi algorithm to calculate the correct sequence of tags.
The hybrid approach that we have developed comprehends three modules: a morphological analysis module, a linguistic-oriented disambiguation rules module and a probabilistic-based disambiguation module. The morphological analysis module adopted is Palavroso, a large coverage analyser developed at INESC. The linguistic-oriented disambiguation rules module is still in development and is based on local grammars. The probabilistic-based disambiguation module is also based on HMMs and uses the Viterbi algorithm to find the correct sequence of tags for the given sequence of words and the forward algorithm to compute the lexical probabilities.