15th Annual Conference of the International Speech Communication Association

September 14-18, 2014

Speech Recognition Based on Itakura-Saito Divergence and Dynamics/Sparseness Constraints from Mixed Sound of Speech and Music by Non-Negative Matrix Factorization

Naoaki Hashimoto, Shoichi Nakano, Kazumasa Yamamoto, Seiichi Nakagawa

Toyohashi University of Technology, Japan

We considered a speech recognition method for mixed sound, which is composed of both speech and music, that only removes music based on non-negative matrix factorization (NMF). We used Itakura-Saito divergence instead of Kullback-Leibler divergence to compare the cost function, and the dynamics and sparseness constraints of a weight matrix to improve speech recognition. For isolated word recognition using the matched condition model, we reduced the word error rate of 52.1% relative from the case that didn't remove music (on average, from 69.3% to 85.3%).

Full Paper

Bibliographic reference.  Hashimoto, Naoaki / Nakano, Shoichi / Yamamoto, Kazumasa / Nakagawa, Seiichi (2014): "Speech recognition based on Itakura-Saito divergence and dynamics/sparseness constraints from mixed sound of speech and music by non-negative matrix factorization", In INTERSPEECH-2014, 2749-2753.