In spoken term detection (STD) systems, approximate subword-level matching of query term and automatically transcribed spoken documents is often employed for its reasonable accuracy and efficiency. However, high out-of-vocabulary (OOV) rate often degrades the subword-level recognition accuracy and affect the STD performance. This paper describes the usage of new expanded acoustic representations of subword sequence for improved scoring between OOV query term and subword-unit transcription. Each subword is expanded in corresponding subword's HMM states and each state is represented as a new acoustic structural feature, a distribution-distance vector (DDV). The proposed DDV representation and scoring is easily combined with two typical baseline STD approaches: a DTW-based approximate matching with subword-level acoustic dissimilarity measure and a lattice-based confidence scoring of subword n-grams. The experimental result showed that the proposed DDV-based scoring method significantly outperforms the simple DTW-scoring baseline with very little increase in the required search time. The combination of the DDV-based scoring with the confidence-based scoring showed the complementary effect and attained the best STD performance compared with the NTCIR-10 SpokenDoc2(SDPWS) submitted results when only the NTCIR reference automatic transcript is used. A preliminary experiment with spoken query terms also showed that the significant improvement for OOV queries.
Bibliographic reference. Makino, Mitsuaki / Yamamoto, Naoki / Kai, Atsuhiko (2014): "Utilizing state-level distance vector representation for improved spoken term detection by text and spoken queries", In INTERSPEECH-2014, 1732-1736.