INTERSPEECH 2013

Factor automaton is an efficient data structure for representing all factors (substrings) of a set of strings (e.g. a finitestate automaton). This notion can be generalized to weighted automata by associating a weight to each factor. In this paper, we consider the problem of computing expected document frequency (DF), and TFIDF statistics for all substrings seen in a collection of word lattices by means of factor automata. We present an algorithm which transforms an acyclic weighted automaton, e.g. an ASR lattice, to a weighted factor automaton where the path weight of each factor represents the total weight associated by the input automaton to the set of strings including that factor at least once. We show how this automaton can be used to efficiently construct other types of weighted factor automata representing DF and TFIDF statistics for all factors seen in a large speech corpus. Compared to the stateoftheart in computing these statistics from spoken documents, our approach i) generalizes the statistics from single words to contiguous substrings, ii) provides significant gains in terms of average runtime and storage requirements and iii) constructs efficient inverted index structures for retrieval of such statistics. Experiments on a Turkish data set corroborate our claims. Acceleration of Spoken Term Detection Using a
Bibliographic reference. Can, Doğan / Narayanan, Shrikanth (2013): "On the computation of document frequency statistics from spoken corpora using factor automata", In INTERSPEECH2013, 610.