14thAnnual Conference of the International Speech Communication Association

Lyon, France
August 25-29, 2013

On the Computation of Document Frequency Statistics from Spoken Corpora Using Factor Automata

Doğan Can, Shrikanth Narayanan

University of Southern California, USA

Factor automaton is an efficient data structure for representing all factors (substrings) of a set of strings (e.g. a finite-state automaton). This notion can be generalized to weighted automata by associating a weight to each factor. In this paper, we consider the problem of computing expected document frequency (DF), and TF-IDF statistics for all substrings seen in a collection of word lattices by means of factor automata. We present an algorithm which transforms an acyclic weighted automaton, e.g. an ASR lattice, to a weighted factor automaton where the path weight of each factor represents the total weight associated by the input automaton to the set of strings including that factor at least once. We show how this automaton can be used to efficiently construct other types of weighted factor automata representing DF and TF-IDF statistics for all factors seen in a large speech corpus. Compared to the state-of-the-art in computing these statistics from spoken documents, our approach i) generalizes the statistics from single words to contiguous substrings, ii) provides significant gains in terms of average run-time and storage requirements and iii) constructs efficient inverted index structures for retrieval of such statistics. Experiments on a Turkish data set corroborate our claims. Acceleration of Spoken Term Detection Using a

Full Paper

Bibliographic reference.  Can, Doğan / Narayanan, Shrikanth (2013): "On the computation of document frequency statistics from spoken corpora using factor automata", In INTERSPEECH-2013, 6-10.