Automatic Extraction of Phonetically Rich Sentences from Large Text Corpus of Indian Languages

Karunesh Arora, Sunita Arora, Kapil Verma, Shyam Sunder Agrawal

C-DAC, Ministry of Communications & Information Technology, India

A set of phonetically rich sentences is a requirement for representing different speech units, to be used for developing Automatic Speech Recognition and Speech Synthesis Systems. Selecting such a set from a large text corpus without modifying the characteristics of the corpus is still a difficult task. A major concern in this process is to decide on what basis sentences must be chosen so that it covers all phonetic aspects of the language under study in a minimum possible size. This paper describes a simple process of automatically extracting such set of sentences from a large text corpus of a given Indian Language and also presents an algorithm for the process. The process discussed in this paper is language independent and works for most of the Indian Languages. The extent of success, in terms of phonetic richness of the sentences, achieved in the process is also discussed.

