ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition

April 13-16, 2003
Tokyo Institute of Technology, Tokyo, Japan

Corpus and Text Analysis of Spontaneous Japanese

Hitoshi Isahara

Communications Research Laboratory, Tokyo, Japan

There are three major parts of the "Spontaneous Speech: Corpus and Processing Technology" project; (1) compilation of large spontaneous speech corpus, (2) establishment of spoken language engineering based on the corpus, and (3) developing a prototype of a spoken language summarization system. This paper describes how we help to develop this large corpus, i.e., (1), using technology developed as a part of (2). Firstly, we discuss how to annotate whole corpus morphologically. Secondly, we explain how we annotate sentence boundaries. And thirdly we discuss discourse annotation for CSJ. This paper describes overviews of these works and details of the works described in this paper are explained in the other papers in this volume.

