In this paper we discuss an in-progress work on the development of a speech corpus for four low-resource Indo-Aryan languages - Awadhi, Bhojpuri, Braj and Magahi - using the field methods of linguistic data collection. The total size of the corpus currently stands at approximately 18 hours (approx. 4-5 hours each language) and it is transcribed and annotated with grammatical information such as part-of-speech tags, morphological features and Universal dependency relationships. We discuss our methodology for data collection in these languages, most of which was done in the middle of the COVID - 19 pandemic, with one of the aims being to generate some additional income for low-income groups speaking these languages. In the paper, we also discuss the results of the baseline experiments for automatic speech recognition system in these languages.
Cite as: Kumar, R., Singh, S., Ratan, S., Raj, M., Sinha, S., Mishra, S., Lahiri, B., Seshadri, V., Bali, K., Ojha, A.K. (2022) Annotated Speech Corpus for Low Resource Indian Langauges: Awadhi, Bhojpuri, Braj and Magahi. Proc. 1st Workshop on Speech for Social Good (S4SG), 1-5, doi: 10.21437/S4SG.2022-1
@inproceedings{kumar22_s4sg, author={Ritesh Kumar and Siddharth Singh and Shyam Ratan and Mohit Raj and Sonal Sinha and Sumitra Mishra and Bornini Lahiri and Vivek Seshadri and Kalika Bali and Atul Kr. Ojha}, title={{Annotated Speech Corpus for Low Resource Indian Langauges: Awadhi, Bhojpuri, Braj and Magahi}}, year=2022, booktitle={Proc. 1st Workshop on Speech for Social Good (S4SG)}, pages={1--5}, doi={10.21437/S4SG.2022-1} }