5th International Conference on Spoken Language Processing
An automated speech signal labeling tool, developed for the QuickSig speech database environment, is described. It is based primarily on the use of neural networks as diphone event detectors. For robustness, only coarse categories of diphones, such as stop-vowel and vowel-nasal, are used. 64 such detectors are implemented to cover all of the Finnish diphones. The preprocessing of speech signals is carried out using warped linear prediction and the diphone events from neural network outputs are matched to the given text transcription using a simple rule-based parser. In the case of isolated word labeling of single speaker signals a well trained system makes about 1-2 % of coarse labeling errors and the deviation of boundary positions, compared to careful manual labeling, is on average about 10 ms. Generalization ability to label other speakers shows promising.
Bibliographic reference. Karjalainen, Matti / Altosaar, Toomas / Huttunen, Miikka (1998): "An efficient labeling tool for the Quicksig speech database", In ICSLP-1998, paper 0885.