INTERSPEECH 2011
12th Annual Conference of the International Speech Communication Association

Florence, Italy
August 27-31. 2011

Combining Frame and Segment Level Processing via Temporal Pooling for Phonetic Classification

Sumit Chopra, Patrick Haffner, Dimitrios Dimitriadis

AT&T Labs Research, USA

We propose a simple, yet novel, multi-layer model for the problem of phonetic classification. Our model combines a frame level transformation of the acoustic signal with a segment level phone classification. Our key contribution is the study of new temporal pooling strategies that interface these two levels, determining how frame scores are converted into segment scores. On the TIMIT benchmark, we match the best performance obtained using a single classifier. Diversity in pooling strategies is further used to generate candidate classifiers with complementary performance characteristics, which perform even better as an ensemble. Without the use of any phonetic knowledge, our ensemble model achieves a 16.96% phone classification error. While our data-driven approach is exhaustive, the combinatorial inflation is limited to the smaller segmental half of the system.

Full Paper

Bibliographic reference.  Chopra, Sumit / Haffner, Patrick / Dimitriadis, Dimitrios (2011): "Combining frame and segment level processing via temporal pooling for phonetic classification", In INTERSPEECH-2011, 233-236.