Structure-based pronunciation assessment

Nobuaki Minematsu, Masayuki Suzuki

We present two demonstration systems both using the same and new speech technologies for pronunciation assessment. In one demo, a learner’s pronunciation of English vowels is assessed by comparing the vowels to those of a teacher who is selected by that learner based on his/her preference. The system instructs which vowel should be corrected at first to become like the selected teacher. In the other demo, Chinese speakers are classified purely based on their dialects. Here, the demo system does not identify the dialect of that speaker but classifies the input speaker and the speakers in the database only based on their dialectal accents, not influenced by age and gender of the speakers. The two demo systems are commonly built on structure-based pronunciation assessment, where a sound system (structure) underlying a speaker’s pronunciation is estimated and the sound system (structure) is compared to that of another speaker. Then, differences between the two systems (structures) are quantitatively calculated. It should be noted that the differences do not include any differences caused by extra-linguistic factors because pronunciation structures are extracted from utterances by removing extra-linguistic features from speech acoustics. For example, the pronunciation of a child can be compared directly to that of a very tall male speaker although their voice quality is totally different. In the same way, dialect-based speaker classification is possible among children and adults. After recording some utterances of a participant, the result of pronunciation assessment is printed out and handed out to the participant within a minute.

Teacher selection window

Classification of Chinese adults and children based on dialects

