Online Archive

SCOOT: Databases


SCOOT: Databases

Modern Speech technology relies on databases (or corpora) for training applications based on Machine Learning.

Corpus linguistics uses databases as a resource for language studies.


The European Language Resource Association (ELRA) is a non-profit organisation whose main mission is to make Language Resources (LRs) for Human Language Technologies (HLT) available to the community at large.

To achieve this goal, ELRA carries out a wide variety of activities around LRs, including Identification & Distribution, Production & Validation, Technology Evaluation, Information Dissemination on HLT.


The Linguistic Data Consortium (LDC) is an open consortium of universities, libraries, corporations and government research laboratories,  based in the USA. LDC was formed in 1992 to address the critical data shortage then facing language technology research and development.

Corpora can be very expensive but many of the classic ones are free or relatively cheap, e.g. TIMIT, the Wall Street Journal CorpusResource Management.