Sixth International Conference on Spoken Language Processing
We use a feedforward net architecture adapted from that introduced by Robinson et.al. We introduce a fully-connected hidden layer between the input and state nodes and the output. We show that this hidden layer makes the learning of complex classification tasks more efficient. Training uses back propagation through time. There is one output unit per speaker, with the training targets corresponding to speaker identity.
For 12 speakers (a mixture of male and female) we obtain a true acceptance rate 100% with a false acceptance rate 4%. For 16 speakers these figures are 94% and 7% respectively. We also investigate the sensitivity of identification accuracy to environmental factors (signal level, change of microphone and band limitation), choice of acoustic vectors (FFT, LPC or Cepstral), distribution of speakers in the training database, inclusion of fundamental frequency. FFT features plus fundamental frequency give the best results.
This performance is shown to compare favorably with studies reported on similar tasks with Hidden Markov Model technique.
Bibliographic reference. Parveen, Shahla / Qadeer, Abdul / Green, Phil (2000): "Speaker recognition with recurrent neural networks", In ICSLP-2000, vol.2, 306-309.