ASR has long attracted attention for call center monitoring systems. In the ASR technology for call center conversations, the system usually divides an input signal into separate utterances and eliminates the unneeded silence parts of the signal before doing ASR processing on the detected utterances. This means the input signal should be split into utterances of the proper length for both ASR performance and readability. However, typical VAD techniques sometimes generate overly long speech segments because they are focused only on the length of the pause (non-speech) between sentences. In contrast, it is shown that speakers typically take breaths for when speaking more than one sentence or long sentences. These breaths are highly correlated with the major prosodic breaks. In this paper, we focus on the breath events in the pause intervals and attempt to split the input signal into utterances by detecting the breathing events. The proposed method leverages acoustic information that is specialized for breathing sounds, which led to a two-step approach to detect the breath events with an accuracy of 97.4%. Also, the proper speech phrasing based on breath events improved word error rate in ASR.
Bibliographic reference. Fukuda, Takashi / Ichikawa, Osamu / Nishimura, Masafumi (2011): "Breath-detection-based telephony speech phrasing", In INTERSPEECH-2011, 2625-2628.