Accent Placement Models for Rigvedic Sanskrit Text
- URL: http://arxiv.org/abs/2511.23088v1
- Date: Fri, 28 Nov 2025 11:22:17 GMT
- Title: Accent Placement Models for Rigvedic Sanskrit Text
- Authors: Akhil Rajeev P, Annarao Kulkarni,
- Abstract summary: The Rigveda, among the oldest Indian texts, employs a distinctive pitch-accent system.<n>This work develops a parallel corpus of interpretive-unaccented lokas.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The Rigveda, among the oldest Indian texts in Vedic Sanskrit, employs a distinctive pitch-accent system : udātta, anudātta, svarita whose marks encode melodic and interpretive cues but are often absent from modern e-texts. This work develops a parallel corpus of accented-unaccented ślokas and conducts a controlled comparison of three strategies for automatic accent placement in Rigvedic verse: (i) full fine-tuning of ByT5, a byte-level Transformer that operates directly on Unicode combining marks, (ii) a from-scratch BiLSTM-CRF sequence-labeling baseline, and (iii) LoRA-based parameter-efficient fine-tuning atop ByT5. Evaluation uses Word Error Rate (WER) and Character Error Rate (CER) for orthographic fidelity, plus a task-specific Diacritic Error Rate (DER) that isolates accent edits. Full ByT5 fine-tuning attains the lowest error across all metrics; LoRA offers strong efficiency-accuracy trade-offs, and BiLSTM-CRF serves as a transparent baseline. The study underscores practical requirements for accent restoration - Unicode-safe preprocessing, mark-aware tokenization, and evaluation that separates grapheme from accent errors - and positions heritage-language technology as an emerging NLP area connecting computational modeling with philological and pedagogical aims. Results establish reproducible baselines for Rigvedic accent restoration and provide guidance for downstream tasks such as accent-aware OCR, ASR/chant synthesis, and digital scholarship.
Related papers
- Enhancing IMU-Based Online Handwriting Recognition via Contrastive Learning with Zero Inference Overhead [4.519836503888727]
We propose a training framework designed to improve feature representation and recognition accuracy without increasing inference costs.<n>ECHWR utilizes a temporary auxiliary branch that aligns sensor signals with semantic text embeddings during the training phase.<n> Evaluations on the OnHW-Words500 dataset show that ECHWR significantly outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2026-02-04T13:44:54Z) - LLM Encoder vs. Decoder: Robust Detection of Chinese AI-Generated Text with LoRA [4.104443734934105]
We compare encoder-based Transformers (Chinese BERT-large and RoBERTa-wwm-ext-large), a decoder-only LLM (Alibaba's Qwen2.5-7B/Deep-R1-Distill-Qwen-7B fine-tuned via Low-Rank Adaptation, LoRA), and a FastText baseline.<n>Experiments reveal that although encoder models nearly memorize training data, they suffer significant performance degradation under distribution shifts.
arXiv Detail & Related papers (2025-08-31T07:51:22Z) - Decoding Translation-Related Functional Sequences in 5'UTRs Using Interpretable Deep Learning Models [0.0]
We introduce UTR-STCNet, a Transformer-based architecture for flexible and biologically grounded modeling of variable-length 5'UTRs.<n>A Saliency-Aware Token Clustering (SATC) module iteratively aggregates nucleotide tokens into meaningful units based on saliency scores.<n>A Saliency-Guided Transformer (SGT) block then captures both local and distal regulatory dependencies using a lightweight attention mechanism.
arXiv Detail & Related papers (2025-07-22T17:51:13Z) - LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization [8.365515332927444]
Recent speech tokenization approaches aim to isolate semantic information from low-level acoustics to better align with language models.<n>We propose LM-SPT, a speech tokenization method that introduces a novel semantic distillation.<n>We show that LM-SPT achieves superior reconstruction fidelity compared to baselines.
arXiv Detail & Related papers (2025-06-20T04:15:14Z) - BrainECHO: Semantic Brain Signal Decoding through Vector-Quantized Spectrogram Reconstruction for Whisper-Enhanced Text Generation [48.20672677492805]
Current EEG/MEG-to-text decoding systems suffer from three key limitations.<n>BrainECHO is a multi-stage framework that employs decoupled representation learning.<n>BrainECHO demonstrates robustness across sentence, session, and subject-independent conditions.
arXiv Detail & Related papers (2024-10-19T04:29:03Z) - Semantically Corrected Amharic Automatic Speech Recognition [27.569469583183423]
We build a set of ASR tools for Amharic, a language spoken by more than 50 million people in eastern Africa.
We release corrected transcriptions of existing Amharic ASR test datasets, enabling the community to accurately evaluate progress.
We introduce a post-processing approach using a transformer encoder-decoder architecture to organize raw ASR outputs into a grammatically complete and semantically meaningful Amharic sentence.
arXiv Detail & Related papers (2024-04-20T12:08:00Z) - Accented Speech Recognition With Accent-specific Codebooks [53.288874858671576]
Speech accents pose a significant challenge to state-of-the-art automatic speech recognition (ASR) systems.
Degradation in performance across underrepresented accents is a severe deterrent to the inclusive adoption of ASR.
We propose a novel accent adaptation approach for end-to-end ASR systems using cross-attention with a trainable set of codebooks.
arXiv Detail & Related papers (2023-10-24T16:10:58Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - Adversarial Training For Low-Resource Disfluency Correction [50.51901599433536]
We propose an adversarially-trained sequence-tagging model for Disfluency Correction (DC)
We show the benefit of our proposed technique, which crucially depends on synthetically generated disfluent data, by evaluating it for DC in three Indian languages.
Our technique also performs well in removing stuttering disfluencies in ASR transcripts introduced by speech impairments.
arXiv Detail & Related papers (2023-06-10T08:58:53Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - CodeBLEU: a Method for Automatic Evaluation of Code Synthesis [57.87741831987889]
In the area of code synthesis, the commonly used evaluation metric is BLEU or perfect accuracy.
We introduce a new automatic evaluation metric, dubbed CodeBLEU.
It absorbs the strength of BLEU in the n-gram match and further injects code syntax via abstract syntax trees (AST) and code semantics via data-flow.
arXiv Detail & Related papers (2020-09-22T03:10:49Z) - Robust Prediction of Punctuation and Truecasing for Medical ASR [18.08508027663331]
This paper proposes a conditional joint modeling framework for prediction of punctuation and truecasing.
We also present techniques for domain and task specific adaptation by fine-tuning masked language models with medical domain data.
arXiv Detail & Related papers (2020-07-04T07:15:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.