ISCA Archive ISCSLP 2004
ISCA Archive ISCSLP 2004

Large Vocabulary Continuous Mandarin Speech Recognition using Finite State Machine

YiCheng Pan, ChiaHsing Yu, LinShan Lee

Finite state transducer (FST), popularly used in natural language processing (NLP) area to represent the grammar rules and the characteristics of a language, has been extensively used as the core in large vocabulary continuous speech recognition (LVCSR) in recent years. By means of FST, we can effectively compose the acoustic model, pronunciation lexicon, and language model to form a compact search space. In this paper, we present our approaches of developing a LVCSR decoder using FST as the core. In addition, the traditional one-pass tree-copy search algorithm is also described for comparison in terms of speed, memory requirements and achieved character accuracy. 1 INTRODUCTION Automata theory has been developed with a long history, and relevant research is still ongoing due to its elegant framework and high efficiency. It has been widely used in natural language processing (NLP) area to model the grammar rules and characteristics of the language. The application of finite state automata on large vocabulary continuous speech recognition (LVCSR) was first introduced by M.Mohri[1], and a new concept of weighted-finite-state machine was introduced, including approaches of transforming the popularly used models such as HMM models and N-gram language models to finite state machine[1][2]. By means of the weighting scheme introduced, we can effectively integrate several probability likelihood functions in a finite state machine in a unified approach. We can then incorporate different sources of knowledge easier and reduce the complexity of the search space. Finite-state-transducer (FST) is an extension of finite state automata. A finite state machine can accept specific input strings (the set of strings accepted by a finite-state-machine is referred to as a “language”) and a FST further has a string as output when it accepts strings. In the LVCSR system using FST as the core, we first compile three basic knowledge sources (HMM acoustic models, pronunciation lexicon, n-gram language model) into three respective FSTs. By means of the composition algorithm, we can further integrate the three FSTs into a vast search network. A Viterbi search is then performed on this network and a best-matched path along with the recognized word sequence will be returned. In the rest of this paper, we first review traditional one-pass tree-copy search algorithm in Section 2. We then introduce our decoder based on FST in Section 3. The experiments and comparison between the two different decoders in terms of speed, memory requirement and character accuracy are presented in Section 4. Finally, in Section 5 some discussions are given. 2 ONE-PASS TREE-COPY SEARCH In this section, we briefly review the traditional one-pass tree copy search algorithm. In this algorithm, the search is implemented in a left-toright, frame-synchronous fashion. We first compile a lexicon tree as shown in Fig.1, in which each arc represents a subsyllabic HMM model and each path from the root to a leaf represents one or several word(s) with the same pronunciation. The arcs visited by each path are the respective HMM models, of which each word is composed. In the process of searching, each active node may have several copies, where each copy represents a different language model history. The path from the root to an active node forms a partial path. If the active node is a leaf node, a new language model history is generated, all the nodes having the same history are recombined, and only the node with the highest score will survive while others pruned. Then from the new node having survived, we further generate a new tree copy or replace the respective existing tree root node, if the tree with specific history is already generated and the score of the new node having survived is higher than the existing one. It should be noted that we don’t need to actually make several tree-copies at run-time, and the tree is just the data structure taken as a reference. In practice, only the active nodes need to be kept in the memory, where each active node represents a partial path. It should be also noted that the number of the nodes activated at each frame may increase rapidly, thus the total number of active nodes may increase exponentially. With limited memory and computing power, we therefore need to further prune those active nodes with lowest scores. During pruning, we need to take the language model scores into account. At each node, there are one or several paths to go to different leaf node(s). We pick up the leaf among them which has the highest uni-gram language model score, and this score is the look-ahead language model score for the node. Each active node is then judged by the sum of their decoding score and the look-ahead language model score during pruning. 3 FST-BASED SEARCH ALGORITHM 3.1 FST and WFST An FST A can be represented as a six-tuple: > ∆ ∑ < δ, , , , , F n Q . Q is the set of all states in A, while n ∊ Q is the initial state and F⊆Q is the set of all final states.∑ is the alphabet set of input strings, while ∆ is the alphabet set of outt_u uo b_u f_u u u l_i g_ee ei iau uei d_u uei 對 腿 脫不了 託付給 j_u ueng 對中 uo sh_u u b_a a 巴 uan b_a ai 團拜 ji_i i 跡 多數 Fig.1 An example of a tree lexicon 0-7803-8678-7/04/$20.00 ©2004 IEEE 5 ISCSLP 2004

Cite as: Pan, Y., Yu, C., Lee, L. (2004) Large Vocabulary Continuous Mandarin Speech Recognition using Finite State Machine. Proc. International Symposium on Chinese Spoken Language Processing, 5-8

  author={YiCheng Pan and ChiaHsing Yu and LinShan Lee},
  title={{Large Vocabulary Continuous Mandarin Speech Recognition using Finite State Machine}},
  booktitle={Proc. International Symposium on Chinese Spoken Language Processing},