Variation in vocal effort represents one of the most challenging problems in maintaining speech system performance for coding, speech and speaker recognition. Changes in vocal effort (or mode) result in a fundamental change in speech production which is not simply a change in volume. This is the first study to collectively consider the five speech modes: whispered, soft, neutral, loud and shouted. After corpus development, analysis is performed for i) sound intensity level, ii) duration and silence percentage, iii) frame energy distribution and iv) spectral tilt. The analysis shows vocal effort dependent traits which are used to investigate speaker recognition. Matched vocal mode conditions result in a closed-set speaker ID rate of 97.62%, with mismatch vocal conditions producing 54.02%. Finally, a speech mode classification system is developed, which has a range of classification rate from 44.5% to 98.5% confusing with adjacent vocal modes. These advancements can provide improved speech/speaker modeling information, as well as classified vocal mode knowledge to improve speech and language technology in real scenarios.
Bibliographic reference. Zhang, Chi / Hansen, John H. L. (2007): "Analysis and classification of speech mode: whispered through shouted", In INTERSPEECH-2007, 2289-2292.