September 22-25, 1997
We make statistical investigations of various speech corpora to extract useful information re ecting the contents of the corpus so that we can create a sort of guidelines for selecting the most suitable corpus. A word is not separated by spaces in the Japanese text. Accordingly, we adopt n-gram counting methods to extract frequent mora sequences instead of words. A mora roughly corresponds to a syllable. By investigating the frequencies of 1 to 10-mora sequences in the existing six corpora, we can find the distinction between the written and the spoken languages, keywords and topics of dialogues. This paper shows that the simple statistical investigation makes it possible to represent the contents of the corpus to some extent without conducting a complicated job such as morphological analysis.
Bibliographic reference. Itahashi, Shuichi / Ueda, Naoko / Yamamoto, Mikio (1997): "Several measures for selecting suitable speech CORPORA", In EUROSPEECH-1997, 1751-1754.