Vedavani: A Benchmark Corpus for ASR on Vedic Sanskrit Poetry
- URL: http://arxiv.org/abs/2506.00145v1
- Date: Fri, 30 May 2025 18:36:54 GMT
- Title: Vedavani: A Benchmark Corpus for ASR on Vedic Sanskrit Poetry
- Authors: Sujeet Kumar, Pretam Ray, Abhinay Beerukuri, Shrey Kamoji, Manoj Balaji Jagadeeshan, Pawan Goyal,
- Abstract summary: We present a 54-hour Sanskrit ASR dataset, consisting of 30,779 labelled audio samples from the Rig Veda and Atharva Veda.<n>This dataset captures the precise prosodic and rhythmic features that define the language.<n>We also benchmark the dataset on various state-of-the-art multilingual speech models.
- Score: 6.415545341980497
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sanskrit, an ancient language with a rich linguistic heritage, presents unique challenges for automatic speech recognition (ASR) due to its phonemic complexity and the phonetic transformations that occur at word junctures, similar to the connected speech found in natural conversations. Due to these complexities, there has been limited exploration of ASR in Sanskrit, particularly in the context of its poetic verses, which are characterized by intricate prosodic and rhythmic patterns. This gap in research raises the question: How can we develop an effective ASR system for Sanskrit, particularly one that captures the nuanced features of its poetic form? In this study, we introduce Vedavani, the first comprehensive ASR study focused on Sanskrit Vedic poetry. We present a 54-hour Sanskrit ASR dataset, consisting of 30,779 labelled audio samples from the Rig Veda and Atharva Veda. This dataset captures the precise prosodic and rhythmic features that define the language. We also benchmark the dataset on various state-of-the-art multilingual speech models.$^{1}$ Experimentation revealed that IndicWhisper performed the best among the SOTA models.
Related papers
- Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations [65.59784436914548]
We introduce the Audio-Visual Speech Romanizer (AV-Romanizer), which learns language-agnostic speech representations by predicting Roman text.<n>We convert the predicted Roman text into language-specific graphemes, forming the proposed Cascaded Zero-AVSR.<n>To capture the wide spectrum of phonetic and linguistic diversity, we also introduce a Multilingual Audio-Visual Romanized Corpus (MARC)
arXiv Detail & Related papers (2025-03-08T16:40:13Z) - LAHAJA: A Robust Multi-accent Benchmark for Evaluating Hindi ASR Systems [16.143694951047024]
We create a benchmark, LAHAJA, which contains read and extempore speech on a diverse set of topics and use cases.
We evaluate existing open-source and commercial models on LAHAJA and find their performance to be poor.
We train models using different datasets and find that our model trained on multilingual data with good speaker diversity outperforms existing models by a significant margin.
arXiv Detail & Related papers (2024-08-21T08:51:00Z) - Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
In this article, we tackle the challenge of developing ASR systems without paired speech and text corpora.<n>We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.<n>This innovative model surpasses the performance of previous unsupervised ASR models under the lexicon-free setting.
arXiv Detail & Related papers (2024-06-12T16:30:58Z) - SD-HuBERT: Sentence-Level Self-Distillation Induces Syllabic Organization in HuBERT [45.729377551100676]
We show that a syllabic organization emerges in learning sentence-level representation of speech.<n>We propose a new benchmark task, Spoken Speech ABX, for evaluating sentence-level representation of speech.
arXiv Detail & Related papers (2023-10-16T20:05:36Z) - Aesthetics of Sanskrit Poetry from the Perspective of Computational
Linguistics: A Case Study Analysis on Siksastaka [11.950202012146498]
This article explores the intersection of Sanskrit poetry and computational linguistics.
We propose a roadmap of an interpretable framework to analyze and classify the qualities and characteristics of fine Sanskrit poetry.
We provide a deep analysis of Siksastaka, a Sanskrit poem, from the perspective of 6 prominent kavyashastra schools.
arXiv Detail & Related papers (2023-08-14T11:26:25Z) - Svarah: Evaluating English ASR Systems on Indian Accents [12.197514367387692]
We create Svarah, a benchmark that contains 9.6 hours of transcribed English audio from 117 speakers across 65 geographic locations throughout India.
Svarah comprises both read speech and spontaneous conversational data, covering various domains, such as history, culture, tourism, etc., ensuring a diverse vocabulary.
We evaluate 6 open source ASR models and 2 commercial ASR systems on Svarah and show that there is clear scope for improvement on Indian accents.
arXiv Detail & Related papers (2023-05-25T06:20:29Z) - Phonemic Representation and Transcription for Speech to Text
Applications for Under-resourced Indigenous African Languages: The Case of
Kiswahili [0.0]
It has emerged that several African indigenous languages, including Kiswahili, are technologically under-resourced.
This paper explores the transcription process and the development of a Kiswahili speech corpus.
It provides an updated Kiswahili phoneme dictionary for the ASR model that was created using the CMU Sphinx speech recognition toolbox.
arXiv Detail & Related papers (2022-10-29T09:04:09Z) - Comprehensive Benchmark Datasets for Amharic Scene Text Detection and
Recognition [56.048783994698425]
Ethiopic/Amharic script is one of the oldest African writing systems, which serves at least 23 languages in East Africa.
The Amharic writing system, Abugida, has 282 syllables, 15 punctuation marks, and 20 numerals.
We presented the first comprehensive public datasets named HUST-ART, HUST-AST, ABE, and Tana for Amharic script detection and recognition in the natural scene.
arXiv Detail & Related papers (2022-03-23T03:19:35Z) - A study on native American English speech recognition by Indian
listeners with varying word familiarity level [62.14295630922855]
We have three kinds of responses from each listener while they recognize an utterance.
From these transcriptions, word error rate (WER) is calculated and used as a metric to evaluate the similarity between the recognized and the original sentences.
Speaker nativity wise analysis shows that utterances from speakers of some nativity are more difficult to recognize by Indian listeners compared to few other nativities.
arXiv Detail & Related papers (2021-12-08T07:43:38Z) - Automatic Speech Recognition in Sanskrit: A New Speech Corpus and
Modelling Insights [25.666767669695044]
We release a 78 hour ASR dataset for Sanskrit, which faithfully captures several of the linguistic characteristics expressed by the language.
We propose a new modelling unit, inspired by the syllable level unit selection, that captures character sequences from one vowel in the word to the next vowel.
We extend these insights from Sanskrit ASR for building ASR systems in two other Indic languages, Gujarati and Telugu.
arXiv Detail & Related papers (2021-06-02T18:06:32Z) - Phoneme Recognition through Fine Tuning of Phonetic Representations: a
Case Study on Luhya Language Varieties [77.2347265289855]
We focus on phoneme recognition using Allosaurus, a method for multilingual recognition based on phonetic annotation.
To evaluate in a challenging real-world scenario, we curate phone recognition datasets for Bukusu and Saamia, two varieties of the Luhya language cluster of western Kenya and eastern Uganda.
We find that fine-tuning of Allosaurus, even with just 100 utterances, leads to significant improvements in phone error rates.
arXiv Detail & Related papers (2021-04-04T15:07:55Z) - Multilingual and code-switching ASR challenges for low resource Indian
languages [59.2906853285309]
We focus on building multilingual and code-switching ASR systems through two different subtasks related to a total of seven Indian languages.
We provide a total of 600 hours of transcribed speech data, comprising train and test sets, in these languages.
We also provide a baseline recipe for both the tasks with a WER of 30.73% and 32.45% on the test sets of multilingual and code-switching subtasks, respectively.
arXiv Detail & Related papers (2021-04-01T03:37:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.