Laughter and fillers like "uhm" and "ah" are social cues expressed in human speech. Detection and interpretation of such non-linguistic events can reveal important information about the speakersf intensions and emotional state. The INTERSPEECH 2013 Social Signals Sub-Challenge sets the task to localize and classify laughter and fillers in the "SSPNet Vocalization Corpus" (SVC) based on acoustics. In the paper at hand we investigate phonetic patterns extracted from raw speech transcriptions obtained with the CMU Sphinx toolkit for speech recognition. Even though Sphinx was used out of the box and no dedicated training on the target classes was applied, we were able to successfully predict laughter and filler frames in the development set with .87% accuracy (unweighted average Area Under the Curve (AUC)). By accumulating our features with a set of standard features provided by the challenge organizers results increased above 92%. When applying the combined set to the test corpus we achieved 87.7% as highest score, which is 4.4% above the challenge baseline.
Bibliographic reference. Wagner, Johannes / Lingenfelser, Florian / André, Elisabeth (2013): "Using phonetic patterns for detecting social cues in natural conversations", In INTERSPEECH-2013, 168-172.