Time-alignment of several minutes of speech to the corresponding text can be divided into sub-tasks. First, perform a broad alignment to detect anchor-points. Second, use these anchor-points to achieve the desired detailed alignment. This paper describes a procedure for the broad alignment. Segments of voiced/unvoiced speech are used to produce the broad alignment. The speech signal is classified into segments of voiced/unvoiced events using a pitch- detection algorithm. The corresponding segments of voiced/unvoiced events are generated from the text. A warp algorithm matches the segments and the broad alignment is achieved. The proposed alignment procedure has been used on eleven data sets ( spoken by four speakers, three male and one female ) with a total error of 4.2% when an automatic pitch-detection algorithm was used to obtain the voiced/unvoiced events and an error of 2.7% when manually edited voiced/unvoiced events were used.
Bibliographic reference. Andersson, Ake / Broman, Holger (1993): "Towards automatic speech-to-text alignment", In EUROSPEECH'93, 301-304.