One of the major challenges of the language identification (LID) system comes from the sparse training data. Manually col- lecting the linguistic data through the controlled studio is usu- ally expensive and impractical. But multilingual broadcast pro- grams (Voice of America, for instance) can be collected as a reasonable alternative to the linguistic data acquisition issue. However, unlike studio collected linguistic data, broadcast pro- grams usually contain many contents other than pure linguis- tic data: musical contents in foreground/background, commer- cials, noise from practical life. In this study, a systematic processing approach is proposed to extract the linguistic data from the broadcast media. The experimental results obtained on NIST LRE 2009 data show that the proposed method can provide 22.2% relative improvement of segmentation accuracy and 20.5% relative improvement of LID accuracy.
Cite as: Liu, G., Zhang, C., Hansen, J.H.L. (2012) A linguistic data acquisition front-end for language recognition evaluation. Proc. The Speaker and Language Recognition Workshop (Odyssey 2012), 224-228
@inproceedings{liu12_odyssey, author={Gang Liu and Chi Zhang and John H. L. Hansen}, title={{A linguistic data acquisition front-end for language recognition evaluation}}, year=2012, booktitle={Proc. The Speaker and Language Recognition Workshop (Odyssey 2012)}, pages={224--228} }