Snow Mountain: Dataset of Audio Recordings of The Bible in Low Resource
Languages
- URL: http://arxiv.org/abs/2206.01205v2
- Date: Tue, 23 May 2023 05:58:35 GMT
- Title: Snow Mountain: Dataset of Audio Recordings of The Bible in Low Resource
Languages
- Authors: Kavitha Raju, Anjaly V, Ryan Lish, Joel Mathew
- Abstract summary: We release an open-licensed and formatted dataset of audio recordings of the Bible in low-resource northern Indian languages.
We setup multiple experimental splits and train and analyze two competitive ASR models to serve as the baseline for future research using this data.
- Score: 0.6193838300896449
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Automatic Speech Recognition (ASR) has increasing utility in the modern
world. There are a many ASR models available for languages with large amounts
of training data like English. However, low-resource languages are poorly
represented. In response we create and release an open-licensed and formatted
dataset of audio recordings of the Bible in low-resource northern Indian
languages. We setup multiple experimental splits and train and analyze two
competitive ASR models to serve as the baseline for future research using this
data.
Related papers
- Remastering Divide and Remaster: A Cinematic Audio Source Separation Dataset with Multilingual Support [5.926447149127937]
We develop version 3 of the Divide and Remaster (DnR) dataset.
This work addresses issues relating to vocal content in non-dialogue stems, loudness distributions, mastering process, and linguistic diversity.
Benchmark results using the Bandit model indicated that training on multilingual data yields significant generalizability to the model.
arXiv Detail & Related papers (2024-07-09T23:39:37Z) - Enabling ASR for Low-Resource Languages: A Comprehensive Dataset Creation Approach [0.6445605125467574]
This study introduces a novel pipeline designed to generate ASR training datasets from audiobooks.
The common structure of these audiobooks poses a unique challenge due to the extensive length of audio segments.
We propose a method for effectively aligning audio with its corresponding text and segmenting it into lengths suitable for ASR training.
arXiv Detail & Related papers (2024-06-03T15:38:40Z) - Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - Lip Reading for Low-resource Languages by Learning and Combining General
Speech Knowledge and Language-specific Knowledge [57.38948190611797]
This paper proposes a novel lip reading framework, especially for low-resource languages.
Since low-resource languages do not have enough video-text paired data to train the model, it is regarded as challenging to develop lip reading models for low-resource languages.
arXiv Detail & Related papers (2023-08-18T05:19:03Z) - Model Adaptation for ASR in low-resource Indian Languages [28.02064068964355]
Automatic speech recognition (ASR) performance has improved drastically in recent years, mainly enabled by self-supervised learning (SSL) based acoustic models like wav2vec2 and large-scale multi-lingual training like Whisper.
A huge challenge still exists for low-resource languages where the availability of both audio and text is limited.
This is where a lot of adaptation and fine-tuning techniques can be applied to overcome the low-resource nature of the data by utilising well-resourced similar languages.
It could be the case that an abundance of acoustic data in a language reduces the need for large text-only corpora
arXiv Detail & Related papers (2023-07-16T05:25:51Z) - ASR2K: Speech Recognition for Around 2000 Languages without Audio [100.41158814934802]
We present a speech recognition pipeline that does not require any audio for the target language.
Our pipeline consists of three components: acoustic, pronunciation, and language models.
We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database.
arXiv Detail & Related papers (2022-09-06T22:48:29Z) - Effectiveness of text to speech pseudo labels for forced alignment and
cross lingual pretrained models for low resource speech recognition [0.0]
We present an approach to create labelled data for Maithili, Bhojpuri and Dogri.
All data and models are available in open domain.
arXiv Detail & Related papers (2022-03-31T06:12:52Z) - Automatic Speech Recognition Datasets in Cantonese Language: A Survey
and a New Dataset [85.52036362232688]
Our dataset consists of 73.6 hours of clean read speech paired with transcripts, collected from Cantonese audiobooks from Hong Kong.
It combines philosophy, politics, education, culture, lifestyle and family domains, covering a wide range of topics.
We create a powerful and robust Cantonese ASR model by applying multi-dataset learning on MDCC and Common Voice zh-HK.
arXiv Detail & Related papers (2022-01-07T12:09:15Z) - Towards Building ASR Systems for the Next Billion Users [15.867823754118422]
We make contributions towards building ASR systems for low resource languages from the Indian subcontinent.
First, we curate 17,000 hours of raw speech data for 40 Indian languages.
Using this raw speech data we pretrain several variants of wav2vec style models for 40 Indian languages.
arXiv Detail & Related papers (2021-11-06T19:34:33Z) - LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition [148.43282526983637]
We develop LRSpeech, a TTS and ASR system for languages with low data cost.
We conduct experiments on an experimental language (English) and a truly low-resource language (Lithuanian) to verify the effectiveness of LRSpeech.
We are currently deploying LRSpeech into a commercialized cloud speech service to support TTS on more rare languages.
arXiv Detail & Related papers (2020-08-09T08:16:33Z) - That Sounds Familiar: an Analysis of Phonetic Representations Transfer
Across Languages [72.9927937955371]
We use the resources existing in other languages to train a multilingual automatic speech recognition model.
We observe significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting.
Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages.
arXiv Detail & Related papers (2020-05-16T22:28:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.