Speech Prosody 2004
This paper presents a novel approach to the automatic detection of pitch accent in spoken English. The approach that we propose is based on a time-delay recursive neural network (TDRNN), which takes into account contextual information in two ways: (1) a delayed version of prosodic and spectral features serve as inputs which represent an explicit trajectory along time; and (2) recursions from the output layer and some hidden layers provide the contextual labeling information that reflects characteristics of pitch accentuation in spoken English. We apply the TDRNN to pitch accent detection in two forms. In the normal TDRNN, all of the prosodic and spectral features are used as an entire set in a single TDRNN. In the distributed TDRNN, the network consists of several TDRNNs each treating each prosodic feature as a single input. In addition, we propose a feature called spectral balance-based cepstral coefficient (SBCC) to capture the spectral characteristic of pitch accentuation. We used the Boston Radio News Corpus (BRNC) to conduct experiments on the speakerindependent detection of pitch accent. The experimental results showed that the automatic labels of pitch accent exhibited an average of 83.6% agreement with the hand labels.
Bibliographic reference. Ren, Yuexi / Kim, Sung-Suk / Hasegawa-Johnson, Mark / Cole, Jennifer (2004): "Speaker-independent automatic detection of pitch accent", In SP-2004, 521-524.