Generating Gestural Scores from Acoustics Through a Sparse Anchor-Based Representation of Speech

Christopher Liberatore, Ricardo Gutierrez-Osuna


We present a procedure for generating gestural scores from speech acoustics. The procedure is based on our recent SABR (sparse, anchor-based representation) algorithm, which models the speech signal as a linear combination of acoustic anchors. We present modifications to SABR that encourage temporal smoothness by restricting the number of anchors that can be active over an analysis window. We propose that peaks in the SABR weights can be interpreted as “keyframes” that determine when vocal tract articulations occur. We validate the approach in two ways. First, we compare SABR keyframes to maxima in the velocity of electromagnetic articulography (EMA) pellets from an articulatory corpus. Second, we use keyframes and SABR weights to build a gestural score for the VocalTractLab (VTL) model, and compare synthetic EMA trajectories generated by VTL against those in the articulatory corpus. We find that SABR keyframes occur within 15–20 ms (on average) of EMA maxima, suggesting that SABR keyframes can be used to identify articulatory phenomena. However, comparison of synthetic and real EMA pellets show moderate correlation on tongue pellets but weak correlation on lip pellets, a result that may be due to differences between the VTL speaker model and the source speaker in our corpus.


DOI: 10.21437/Interspeech.2016-1336

Cite as

Liberatore, C., Gutierrez-Osuna, R. (2016) Generating Gestural Scores from Acoustics Through a Sparse Anchor-Based Representation of Speech. Proc. Interspeech 2016, 1507-1511.

Bibtex
@inproceedings{Liberatore+2016,
author={Christopher Liberatore and Ricardo Gutierrez-Osuna},
title={Generating Gestural Scores from Acoustics Through a Sparse Anchor-Based Representation of Speech},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-1336},
url={http://dx.doi.org/10.21437/Interspeech.2016-1336},
pages={1507--1511}
}