CNN-Based Phone Segmentation Experiments in a Less-Represented Language

Céline Manenti, Thomas Pellegrini, Julien Pinquier

These last years, there has been a regain of interest in unsupervised sub-lexical and lexical unit discovery. Speech segmentation into phone-like units may be a first interesting step for such a task. In this article, we report speech segmentation experiments in Xitsonga, a less-represented language spoken in South Africa. We chose to use convolutional neural networks (CNN) with FBANK static coefficients as input. The models take binary decisions whether a boundary is present or not at each signal sliding frame. We compare the use of a model trained exclusively on Xitsonga data to the use of a bootstrap model trained on a larger corpus of another language, the BUCKEYE U.S. English corpus. Using a two-convolution-layer model, a 79% F-measure was obtained on BUCKEYE, with a 20 ms error tolerance. This performance is equal to the human inter-annotator agreement rate. We then used this bootstrap model to segment Xitsonga data and compared the results when adapting it with 1 to 20 minutes of Xitsonga data.

DOI: 10.21437/Interspeech.2016-796

Cite as

Manenti, C., Pellegrini, T., Pinquier, J. (2016) CNN-Based Phone Segmentation Experiments in a Less-Represented Language. Proc. Interspeech 2016, 3549-3553.

author={Céline Manenti and Thomas Pellegrini and Julien Pinquier},
title={CNN-Based Phone Segmentation Experiments in a Less-Represented Language},
booktitle={Interspeech 2016},