To obtain a more human-like interaction with technical systems, those have to be adaptable to the users individual preferences, and current emotional state. In human-human interaction the behaviour of the speaker is characterised by semantic and prosodic cues, given (among other indicators) as short feedback signals. These so called filled pauses minimally convey certain dialogue functions such as attention, understanding, confirmation, or other attitudinal reactions. These signals play a valuable role in the progress and coordination of interaction. Hereby, the first step enabling an automatic system to react on these signals is the detection of them within the users utterances. This is a quite complex task, as the filled pauses are phonetically short, consisting mostly only of one vowel and one consonant. In this paper we present our methods to detect filled pauses in a naturalistic interaction utilising the LAST MINUTE corpus. We used an SVM classifier and improved the results further, by applying a Gaussian filter to infer temporal context information and performing a morphological opening to filter false alarms. We obtained recall of 70%, precision of 55%, and AUC of 0.94.
Bibliographic reference. Prylipko, Dmytro / Egorow, Olga / Siegert, Ingo / Wendemuth, Andreas (2014): "Application of image processing methods to filled pauses detection from spontaneous speech", In INTERSPEECH-2014, 1816-1820.