EUROSPEECH 2003 - INTERSPEECH 2003
Current Automatic Speech Recognition systems convert the speech signal into a sequence of discrete units, such as phonemes, and then apply statistical methods on the units to produce the linguistic message. Similar methodology has also been applied to recognize speaker and language, except that the output of the system can be the speaker or language information. Therefore, we propose the use of temporal trajectories of fundamental frequency and short-term energy to segment and label the speech signal into a small set of discrete units that can be used to characterize speaker and/or language. The proposed approach is evaluated using the NIST Extended Data Speaker Detection task and the NIST Language Identification task.
Bibliographic reference. Adami, Andre G. / Hermansky, Hynek (2003): "Segmentation of speech for speaker and language recognition", In EUROSPEECH-2003, 841-844.