ISCA Archive Interspeech 2013
ISCA Archive Interspeech 2013

On the computation of document frequency statistics from spoken corpora using factor automata

Doğan Can, Shrikanth Narayanan

Factor automaton is an efficient data structure for representing all factors (substrings) of a set of strings (e.g. a finite-state automaton). This notion can be generalized to weighted automata by associating a weight to each factor. In this paper, we consider the problem of computing expected document frequency (DF), and TF-IDF statistics for all substrings seen in a collection of word lattices by means of factor automata. We present an algorithm which transforms an acyclic weighted automaton, e.g. an ASR lattice, to a weighted factor automaton where the path weight of each factor represents the total weight associated by the input automaton to the set of strings including that factor at least once. We show how this automaton can be used to efficiently construct other types of weighted factor automata representing DF and TF-IDF statistics for all factors seen in a large speech corpus. Compared to the state-of-the-art in computing these statistics from spoken documents, our approach i) generalizes the statistics from single words to contiguous substrings, ii) provides significant gains in terms of average run-time and storage requirements and iii) constructs efficient inverted index structures for retrieval of such statistics. Experiments on a Turkish data set corroborate our claims. Acceleration of Spoken Term Detection Using a


doi: 10.21437/Interspeech.2013-2

Cite as: Can, D., Narayanan, S. (2013) On the computation of document frequency statistics from spoken corpora using factor automata. Proc. Interspeech 2013, 6-10, doi: 10.21437/Interspeech.2013-2

@inproceedings{can13_interspeech,
  author={Doğan Can and Shrikanth Narayanan},
  title={{On the computation of document frequency statistics from spoken corpora using factor automata}},
  year=2013,
  booktitle={Proc. Interspeech 2013},
  pages={6--10},
  doi={10.21437/Interspeech.2013-2}
}