Elicitation of information structure from speech is a crucial step in automatic speech understanding. In terms of both production and perception, we consider intonational phrase to be the basic meaningful unit of information structure in speech. The current paper presents a method of detecting these units in speech by processing both the recorded speech and its textual representation. Using syntactic information, we split text into small groups of words closely connected with each other. Assuming that intonational phrases are built from these small groups, we use acoustic information to reveal their actual boundaries. The procedure was initially developed for processing Russian speech, and we have achieved the best published results for this language with F1 equal to 0.91. We assume that it may be adapted for other languages that have some amount of read speech resources, including under-resourced languages. For comparison we have evaluated it on English material (Boston University Radio Speech Corpus). Our results, F1 of 0.76, are comparable with the top systems designed for English.
Cite as: Kocharov, D., Kachkovskaia, T., Skrelin, P. (2017) Eliciting Meaningful Units from Speech. Proc. Interspeech 2017, 2128-2132, doi: 10.21437/Interspeech.2017-855
@inproceedings{kocharov17_interspeech, author={Daniil Kocharov and Tatiana Kachkovskaia and Pavel Skrelin}, title={{Eliciting Meaningful Units from Speech}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={2128--2132}, doi={10.21437/Interspeech.2017-855} }