Large Vocabulary Continuous Speech Recognition of Inflectional Language with Subword Units Stem-Ending

From HLT@INESC-ID

The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Date

  • November 05, 2004

Speaker

  • Tomaz Rotovnik

Abstract

In large vocabulary speech recognition system, recognition process itself is very wasteful time-wise. Currently, the state-of-the-art recognition systems are able to use vocabularies with the size of 20K to 60K of words. These systems have mostly been developed for English, which belongs to a group of uninflectional languages. Slovenian, on the other hand, belongs to a group of inflectional languages together with other Slavic languages. Its rich morphology therefore presents a major problem in large vocabulary speech recognition. Compared to English, Slovenian language demands a ten-time larger vocabulary for the same degree of text corpus coverage, where the size limitation for vocabulary causes a high degree of Out-of-Vocabulary (OOV) words and, consequently, OOV words have a direct impact on recognizer efficiency. Slovenian language fe! atures many different word forms which can be derived from the same basis (lemma). In thesis, we present a new algorithm for lemma-based decomposition into stems and endings. First, a set of words sharing the same lemma is defined. For each set, new decomposition algorithm based on lemma is used. The length of stem and the number of stems can be defined using the cut_ratio parameter. When decomposing words into subword units, we also need to decompose phonetical transcriptions, which is not always a trivial process, because the number of letters in a word sometimes does not match the number of its phonemes. The problem was solved with the alignment algorithm based on Edit distance method. Characteristics of flectional languages have been considered in developing a new search algorithm with a method to restrict the correct order of subword units and with separate language models. The search algorithm combines properties of subword-based mo! dels (reduced OOV) and word-based models (the length of context). The algorithm also enables better search-space limitation of subword models. For the presented search algorithms we also determine the upper search space boundary, which is expressed in the number of active models. Using subword models, we increase recognizer accuracy and achieve a comparable search space to that of a standard word-based recognizer. Experiment results were evaluated on SNABI speech database.