Sixth International Conference on Spoken Language Processing
Landmark based speech processing is a component of Lex- ical Access From Features (LAFF), a novel paradigm for feature based speech recognition. Detection and classification of landmarks is a crucial first step in a LAFF system. This work tests the theoretical characteristics of vowels, and shows results for work in progress on a Vowel Landmark Detector.
Acoustic theory predicts first formant peaks in vowels, both in frequency and amplitude (at least for vowels between orally closed consonants). Formant tracking measurements found peaks in about 94% of vowels in the TIMIT database. Vowels which do not show a peak generally do not obey the theoretical assumptions, or are liable to formant tracker error due to nasalization, glottalization, or aspiration. Amplitude peaks are more reliable than frequency peaks. Peaks tend to occur early in the vowel, and frequency peaks tend to occur slightly before amplitude peaks. A fixed spectral band gave performance comparable to the formant tracker for this task, allowing a simpler detection algorithm.
Previous work on a Vowel Landmark Detector is extended by use of a multilayer perceptron (MLP) to combine knowledge-based acoustic cues. The MLP decreases error rate to about 12%, of which about 8% are deletions. Since about 6% of vowels had no detectable peak, this performance is close to the expected limit of a peak picking algorithm. Work is continuing on algorithm improvements, including the output of confidence scores.
Bibliographic reference. Howitt, Andrew Wilson (2000): "Vowel landmark detection", In ICSLP-2000, vol.4, 628-631.