![]() |
INTERSPEECH 2014
|
![]() |
To automatically build from scratch the language processing component for a
speech synthesis system in a new language a purified text corpora is needed
where any words and phrases from other languages are clearly identified or excluded.
When using found data and where there is no inherent linguistic knowledge of
the language/languages contained in the data, identifying the pure data is a
difficult problem.
We propose an unsupervised language identification approach
based on Latent Dirichlet Allocation where we take the raw n-gram count as features
without any smoothing, pruning or interpolation. The Latent Dirichlet Allocation
topic model is reformulated for the language identification task and Collapsed
Gibbs Sampling is used to train an unsupervised language identification model.
We show that such a model is highly capable of identifying the primary language
in a corpus and filtering out other languages present.To automatically build
from scratch the language processing component for a speech synthesis system
in a new language a purified text corpora is needed where any words and phrases
from other languages are clearly identified or excluded. When using found data
and where there is no inherent linguistic knowledge of the language/languages
contained in the data, identifying the pure data is a difficult problem.
We propose an unsupervised language identification approach
based on Latent Dirichlet Allocation where we take the raw n-gram count as features
without any smoothing, pruning or interpolation. The Latent Dirichlet Allocation
topic model is reformulated for the language identification task and Collapsed
Gibbs Sampling is used to train an unsupervised language identification model.
We show that such a model is highly capable of identifying the primary language
in a corpus and filtering out other languages present.
Bibliographic reference. Zhang, Wei / Clark, Robert A. J. / Wang, Yongyuan (2014): "Unsupervised language filtering using the latent dirichlet allocation", In INTERSPEECH-2014, 1268-1272.