ASR

Home ASR Challenge Offline Test with voice

A challenge on building Automatic Speech Recognition (ASR) system for the Telugu language is being organized jointly by IIIT Hyderabad, Technology Development for Indian Languages (TDIL), Ministry of Electronics and Information Technology (MeitY) as a part of the National Language Translation Mission (NLTM). In this challenge, we are releasing a 2000.8 hours Telugu Speech Database which is collected in a crowdsourced manner. The regional variations of Telugu speech are collected in three modes, namely, i.e. spontaneous, conversational, and read modes with different background conditions (clean and moderate noisy environments) and transcriptions with varying degrees of accuracy due to crowdsourcing.

The financial assistance received towards this Telugu database collection is from the Technology Development for Indian Languages (TDIL), Ministry of Electronics and Information Technology (MeitY), Government of Republic of India under the pilot project on "Crowd Sourced Large Speech Data Sets to Enable Indian Language Speech - Speech Solutions".

Important Dates

Challenge registration open :: September 6th 2022 (09:00 AM IST)
Release of training and development data :: September 7th 2022 (09:00 AM IST)
Release of evaluation data :: ~~September 25th 2022 (09:00 AM IST)~~October 15th 2022
Submission portal open :: ~~September 26th 2022 (09:00 AM IST)~~ ~~October 17th 2022~~ October 19th 2022 (IST)
Submission portal close :: ~~September 30th 2022 (09:00 AM IST)~~ ~~October 22nd 2022~~ October 24th 2022 (IST)
Update of baseline results :: ~~October 1st 2022 (05:30 PM IST)~~ ~~October 19th 2022~~ October 23rd 2022 (IST)
Leaderboard open: ~~October 1st 2022 (09:00 AM IST)~~ ~~October 20th 2022~~ October 23rd 2022 (IST)
Leaderboard close: ~~October 1st 2022 (11:00 PM IST)~~ ~~October 22nd 2022~~ October 24th 2022 (IST)
Result Announcements: ~~October 3rd 2022 (06:00 PM IST)~~ ~~October 24th 2022~~ October 30th 2022 (IST)

Challenge Overview

The recent advancements in the field of deep learning have allowed us to build speech-based solutions which are robust to language, speaker, and environmental variations. However, building good quality speech systems is still a challenging task. One of the primitive blocks in building speech-based products is the ASR system. To build an ASR system, we need to have a database that covers large amounts of training data from different speakers, different environments, and high-end computational resources. Over the last decade, there have been many attempts to create corpora for English speech recognition tasks from academia and industry. The performance of a speech recognition system mainly depends on the quality and size of the training corpora. It is observed that the amount of data available for languages like English is in the magnitude of a hundred thousand hours. Due to this, some of the studies claim that English ASR for a generic domain has achieved human parity (4%).

However, As India is a land of language diversity, wherein it has around 1500 languages being spoken as a medium of communication. Out of which, 22 languages are recognized as scheduled languages by the Government of India. According to a recent census, 30 Indian languages have more than a million speakers. Among those languages, except for Hindi, all others are considered to be low-resourced. As a part of this challenge, we are releasing a 2000.8 hours Telugu Speech Database which is collected in a crowdsourced manner. Everyone who participates in this challenge will then be free to use this data for research purposes.

Enroll yourself by registering on this link: Registration. The download link will be mailed within 24 hours of registering. If the link isnt provided within 24 hours, please reach out to us via speechlabiiit@gmail.com or mirishkar.ganesh@research.iiit.ac.in

Challenge Rules

The data released as a part of this challenge can be used freely for academic purposes but permission for any commercial use of the data should be sought by writing to mirishkar.ganesh@research.iiit.ac.in and anil.vuppala@iiit.ac.in

Participants are supposed to use only the released audio data only for training.
The metric used for evaluation is Word Error Rate (WER).
A maximum of three submissions is allowed for each track.
Only the audio for the blind test set will be released. Participants are expected to run their systems on the blind test set and submit the ASR hypotheses for evaluation.
Participants will need to share their final ASR model or an API of their model, to reproduce the hypotheses against the blind test set.
If registered participants feel that they cannot submit a system, they will have to submit a withdrawal clause that states that they will use the data for research purposes only by drafting an email to mirishkar.ganesh@research.iiit.ac.in and anil.vuppala@iiit.ac.in .

There are two type of tracks:

Track - 1 (Closed Challenge)
- Participants can use only the 2000+ hours Train dataset and 5+ hours of development dataset for training models (Both acoustic and language models.)
  Note: Cannot use any pretrained model or data
Track - 2 (Open Challenge)
- Participants can use any external/additional dataset for training models (Both acoustic and language model)
  Note: Open to all external data, pretrained models etc

Registration

FAQ

Whether the punctuation marks are also going to be present in the evaluation set?

No, there aren't any punctuation marks in the evaluation set. A thorough text normalization is done over it.

Can we delete the punctuation marks in the train and dev set?

Yes, you can delete the punctuation marks while training your model.

Are there any baseline results for the given data?

Yes, we do have a baseline result, but it would be open on ~~October 1st 2022 (17:30 IST)~~ TBA along with the leaderboard position.

Are there any restrictions for training the model? ( can we augment noise, add external data to train the model )

Yes, we do have imposed certain rules for building an ASR.

The participants are supposed to use only the released audio data only for training, but you can perform data augmentation over the audio data (by adding external noise (make sure while adding there are no speech segments involved in it), speed perturbation, volume perturbation). And make sure that the data augmentation technique whichever you are using is open-sourced.
Participants need not limit themselves to the text of transcriptions provided in released datasets to build a language model or lexicon if required. They will be required to make those resources publicly available.

Challenge Submission Portal

Submission Link: Submission.

The metric used for evaluation is Word Error Rate (WER) only.
A maximum of three submissions is allowed for each track.
Participants will need to share their final ASR model or an API of their model, to reproduce the hypotheses against the blind test set.
The decode files submitted should be in the standard Kaldi text file format. Partial decodes will not be accepted.
Submission link Click here

Leaderboard

Contact Us

Please reach out to us via speechlabiiit@gmail.com or mirishkar.ganesh@research.iiit.ac.in