Our goal is to process the soundtrack of a sports game (tennis) to understand the progress of the game and ultimately, infer its rules. The chair umpire's speech is one of the most useful sources of information, and we focus on identifying the locations of this signal on the soundtrack. Although current techniques for audio segmentation can work well on this task when the acoustics of the training- and test- data are well-matched, they fail when there is a mismatch, which occurs when the chair umpire, the microphone placement, the environmental noise etc. are different in the testand training-data. Our technique uses high-level knowledge of the syntax of the audio events (derived from the training data) to make a coarse estimate of the location of the umpire's speech. The data gathered from these locations is then iteratively refined by contrasting it with data that is believed to belong to another audio class (also gathered using the technique described above). A model is built from this data that enables a more accurate determination of the location of the speech segments to be made. Our approach is applied to three different tennis games: all three have different umpires and different commentators. The results obtained show that it reaches almost the same performance level as that obtained using supervised methods, in which models for the speech are built using prior knowledge of their locations.
Bibliographic reference. Huang, Qiang / Cox, Stephen J. (2011): "Iterative improvement of speaker segmentation in a noisy environment using high-level knowledge", In INTERSPEECH-2011, 417-420.