ISCA Archive ICSLP 1994
ISCA Archive ICSLP 1994

Automating the design of compact linguistic corpora

Rob Kassel

In this paper we address two aspects of linguistic corpus construction. First we examine the process of selecting the units to be covered in our design. Rather than enumerating a set of fixed-length units, we derive variable-length units based on a measure of cohesiveness. Next we consider the selection of material to cover efficiently these, or other, units. Our scoring procedure takes into account frequency distributions to improve the result's compactness. The proposed techniques have been successfully applied to the design of a handwriting corpus at MIT and a speech corpus elsewhere.


Cite as: Kassel, R. (1994) Automating the design of compact linguistic corpora. Proc. 3rd International Conference on Spoken Language Processing (ICSLP 1994), 1827-1830

@inproceedings{kassel94_icslp,
  author={Rob Kassel},
  title={{Automating the design of compact linguistic corpora}},
  year=1994,
  booktitle={Proc. 3rd International Conference on Spoken Language Processing (ICSLP 1994)},
  pages={1827--1830}
}