A range of computational approaches have been used to model the discovery of word forms from continuous speech by infants. Typically, these algorithms are evaluated with respect to the ideal ‘gold standard’ word segmentation and lexicon. These metrics assess how well an algorithm matches the adult state, but may not reflect the intermediate states of the child’s lexical development. We set up a new evaluation method based on the correlation between word frequency counts derived from the application of an algorithm onto a corpus of child-directed speech, and the proportion of infants knowing those words, according to parental reports. We evaluate a representative set of 4 algorithms, applied to transcriptions of the Brent corpus, which have been phonologized using either phonemes or syllables as basic units. Results show remarkable variation in the extent to which these 8 algorithm-unit combinations predicted infant vocabulary, with some of these predictions surpassing those derived from the adult gold standard segmentation. We argue that infant vocabulary prediction provides a useful complement to traditional evaluation; for example, the best predictor model was also one of the worst in terms of segmentation score, and there was no clear relationship between token or boundary F-score and vocabulary prediction.
Cite as: Larsen, E., Cristia, A., Dupoux, E. (2017) Relating Unsupervised Word Segmentation to Reported Vocabulary Acquisition. Proc. Interspeech 2017, 2198-2202, doi: 10.21437/Interspeech.2017-937
@inproceedings{larsen17_interspeech, author={Elin Larsen and Alejandrina Cristia and Emmanuel Dupoux}, title={{Relating Unsupervised Word Segmentation to Reported Vocabulary Acquisition}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={2198--2202}, doi={10.21437/Interspeech.2017-937} }