In the L2F’s Natural Language Processing (NLP) chain, STRING, the tokenization and morphological analysis were performed by two independent modules. The tokenization was performed by regular ex- pressions and the morphological analysis was made using a transducer to assign the morpho-syntactical (part-of-speech) tags to the tokens. This work allowed the union of the tokenization module and the morphological analysis in a single module, LexMan, using transducers. With this change, it was possible to transfer morpho-syntactic, context-independent, joining rules (for compound identification), previously implemented in the chain’s morphosyntactic disambiguator, RuDriCo to the LexMan module. The information used in the generation of the dictionary transducer can now be complemented also by derivational information, making possible to recognise prefixed-derived words, particularly neologisms. Two architectures were created and evaluated, comparing them with the initial architecture. The two new architectures proved to be more efficient in the processing of large-sized texts. Considering the largest texts submitted to evaluation, the Prune-based architecture was 8.63% faster than the ShortestPath-based one, and 69.60% faster that the initial architecture. It was in this faster, Prune-based architecture that the prefixes’ module was integrated. This made possible to extend the coverage of the system lexical resources. The integration of this new module resulted in 15.56% increase in the system’s performance time, considering the same evaluation texts. That loss in performance was attenuated after the removal of the now redundant prefixed words from the dictionary of lemmas.
This thesis addresses the problem of Verb Sense Disambiguation (VSD) in European Portuguese. Verb Sense Disambiguation is a sub-problem of the Word Sense Disambiguation (WSD) problem, that tries to identify in which sense a polissemic word is used in a given sentence. Thus a sense inventory for each word (or lemma) must be used. For the VSD problem, this sense inventory consisted in a lexicon-syntactic classification of the most frequent verbs in European Portuguese (ViPEr).
Two approaches to VSD were considered. The first, rule-based, approach makes use of the lexical, syntactic and semantic descriptions of the verb senses present in ViPEr to determine the meaning of a verb. The second approach uses machine learning with a set of features commonly used in the WSD problem to determine the correct meaning of the target verb.
Both approaches were tested in several scenarios to determine the impact of different features and different combinations of methods. The baseline accuracy of 84%, resulting from the most frequent sense for each verb lemma, was both surprisingly high and hard to surpass. Still, both approaches provided some improvement over this value. The best combination of the two techniques and the baseline yielded an accuracy of 87.2%, a gain of 3.2% above the baseline.