We describe a state-of-the-art large vocabulary continuous speech recognition (LVCSR) and keyword search (KWS) system trained on roughly 70 hours of conversational telephone speech. Using the Kaldi speech recognition toolkit, we investigate several aspects: for the acoustic front-end, we analyze the use of mel-frequency cepstral coefficients (MFCC), pitch and probability-of-voicing (PoV), and deep neural network (DNN) bottleneck (BN) features, as well as their feature-level combination ("tandem"). For the acousticphonetic decision tree, we explore different hidden Markov model (HMM) topologies for the glottalization phoneme /?/ to model its typically short duration. For the acoustic model, we compare regular continuous HMM with a sort of multi-codebook subspace Gaussian mixture model (SGMM) that lead to an overall best word error rate (WER) of 58.7% and 56.3%, respectively. The KWS is implemented as a word lattice search, and is augmented by a syllable lattice back-up search to capture out-of-vocabulary keywords as well as misrecognized lexical surface forms due to ambiguous prefix and hyphenation rules.
Bibliographic reference. Riedhammer, Korbinian / Do, Van Hai / Hieronymus, James (2013): "A study on LVCSR and keyword search for tagalog", In INTERSPEECH-2013, 2529-2533.