Related papers: BanglaIPA: Towards Robust Text-to-IPA Transcription with Contextual Rewriting in Bengali

BanglaIPA: Towards Robust Text-to-IPA Transcription with Contextual Rewriting in Bengali

URL: http://arxiv.org/abs/2601.01778v1
Date: Mon, 05 Jan 2026 04:17:31 GMT
Title: BanglaIPA: Towards Robust Text-to-IPA Transcription with Contextual Rewriting in Bengali
Authors: Jakir Hasan, Shrestha Datta, Md Saiful Islam, Shubhashis Roy Dipta, Ameya Debnath,
Abstract summary: We propose BanglaIPA, a novel IPA generation system that integrates a character-based vocabulary with word-level alignment.<n>The proposed system accurately handles Bengali numerals and demonstrates strong performance across regional dialects.
Score: 1.1347335625859423
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Despite its widespread use, Bengali lacks a robust automated International Phonetic Alphabet (IPA) transcription system that effectively supports both standard language and regional dialectal texts. Existing approaches struggle to handle regional variations, numerical expressions, and generalize poorly to previously unseen words. To address these limitations, we propose BanglaIPA, a novel IPA generation system that integrates a character-based vocabulary with word-level alignment. The proposed system accurately handles Bengali numerals and demonstrates strong performance across regional dialects. BanglaIPA improves inference efficiency by leveraging a precomputed word-to-IPA mapping dictionary for previously observed words. The system is evaluated on the standard Bengali and six regional variations of the DUAL-IPA dataset. Experimental results show that BanglaIPA outperforms baseline IPA transcription models by 58.4-78.7% and achieves an overall mean word error rate of 11.4%, highlighting its robustness in phonetic transcription generation for the Bengali language.

Related papers

Simultaneous Speech-to-Speech Translation Without Aligned Data [52.467808474293605]
Simultaneous speech translation requires translating source speech into a target language in real-time.<n>We propose Hibiki-Zero, which eliminates the need for word-level alignments entirely.<n>Hibiki-Zero achieves state-of-the-art performance in translation accuracy, latency, voice transfer, and naturalness across five X-to-English tasks.
arXiv Detail & Related papers (2026-02-11T17:41:01Z)
Evaluating Subword Tokenization Techniques for Bengali: A Benchmark Study with BengaliBPE [0.0]
BengaliBPE is a language-aware subword tokenizer for the Bengali script.<n>It applies Unicode normalization and morphology-aware merge rules to maintain linguistic consistency and preserve subword integrity.<n>It provides the most detailed segmentation and the best morphological interpretability, albeit with slightly higher computational cost.
arXiv Detail & Related papers (2025-11-07T15:23:32Z)
Bilingual Word Level Language Identification for Omotic Languages [44.04646981451376]
This paper presents Bilingual Language Identification (BLID) for languages spoken in the southern part of Ethiopia, namely Wolaita and Gofa.<n>To overcome this challenge, we employed various experiments on various approaches.<n>The combination of the BERT based pretrained language model and LSTM approach performed better, with an F1 score of 0.72 on the test set.
arXiv Detail & Related papers (2025-09-05T23:36:26Z)
PhoniTale: Phonologically Grounded Mnemonic Generation for Typologically Distant Language Pairs [51.745816131869674]
Large language models (LLMs) have been used to generate keyword mnemonics by leveraging similar keywords from a learner's first language (L1) to aid in acquiring L2 vocabulary.<n>We present PhoniTale, a novel cross-lingual mnemonic generation system that performs IPA-based phonological adaptation and syllable-aware alignment to retrieve L1 keyword sequence.<n>Our findings show that PhoniTale consistently outperforms previous automated approaches and achieves quality comparable to human-written mnemonics.
arXiv Detail & Related papers (2025-07-07T19:50:12Z)
Improving Informally Romanized Language Identification [42.001850291770445]
Romanization renders languages that are normally easily distinguished due to being written in different scripts.<n>We increase language identification (LID) accuracy for romanized text by improving the methods used to synthesize training sets.<n>We demonstrate new state-of-the-art LID performance on romanized text from 20 Indic languages in the Bhasha-Abhijnaanam evaluation set.
arXiv Detail & Related papers (2025-04-30T11:36:28Z)
CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text. We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules. COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z)
IPA Transcription of Bengali Texts [0.2113150621171959]
The International Phonetic Alphabet (IPA) serves to systematize phonemes in language. In Bengali phonology and phonetics, ongoing scholarly deliberations persist concerning the IPA standard and core Bengali phonemes. This work examines prior research, identifies current and potential issues, and suggests a framework for a Bengali IPA standard.
arXiv Detail & Related papers (2024-03-29T09:33:34Z)
Transcribing Bengali Text with Regional Dialects to IPA using District Guided Tokens [0.0]
This paper introduces the District Guided Tokens (DGT) technique on a new dataset spanning six districts of Bangladesh. The DGT technique is applied to fine-tune several transformer-based models, on this new dataset. Experimental results demonstrate the effectiveness of DGT, with the ByT5 model achieving superior performance over word-based models.
arXiv Detail & Related papers (2024-03-26T05:55:21Z)
MaiBaam: A Multi-Dialectal Bavarian Universal Dependency Treebank [56.810282574817414]
We present the first multi-dialect Bavarian treebank (MaiBaam) manually annotated with part-of-speech and syntactic dependency information in Universal Dependencies (UD) We highlight the morphosyntactic differences between the closely-related Bavarian and German and showcase the rich variability of speakers' orthographies. Our corpus includes 15k tokens, covering dialects from all Bavarian-speaking areas spanning three countries.
arXiv Detail & Related papers (2024-03-15T13:33:10Z)
Character-Level Bangla Text-to-IPA Transcription Using Transformer Architecture with Sequence Alignment [0.0]
International Phonetic Alphabet (IPA) is indispensable in language learning and understanding. Bhutan being 7th as one of the widely used languages, gives rise to the need for IPA in its domain. In this study, we have utilized a transformer-based sequence-to-sequence model at the letter and symbol level to get the IPA of each Bangla word.
arXiv Detail & Related papers (2023-11-07T08:20:06Z)
Phoneme Recognition through Fine Tuning of Phonetic Representations: a Case Study on Luhya Language Varieties [77.2347265289855]
We focus on phoneme recognition using Allosaurus, a method for multilingual recognition based on phonetic annotation. To evaluate in a challenging real-world scenario, we curate phone recognition datasets for Bukusu and Saamia, two varieties of the Luhya language cluster of western Kenya and eastern Uganda. We find that fine-tuning of Allosaurus, even with just 100 utterances, leads to significant improvements in phone error rates.
arXiv Detail & Related papers (2021-04-04T15:07:55Z)
Simple or Complex? Learning to Predict Readability of Bengali Texts [6.860272388539321]
We present a readability analysis tool capable of analyzing text written in the Bengali language. Despite being the 7th most spoken language in the world with 230 million native speakers, Bengali suffers from a lack of fundamental resources for natural language processing.
arXiv Detail & Related papers (2020-12-09T01:41:35Z)
AlloVera: A Multilingual Allophone Database [137.3686036294502]
AlloVera provides mappings from 218 allophones to phonemes for 14 languages. We show that a "universal" allophone model, Allosaurus, built with AlloVera, outperforms "universal" phonemic models and language-specific models on a speech-transcription task.
arXiv Detail & Related papers (2020-04-17T02:02:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.