Study of Indian English Pronunciation Variabilities relative to Received
Pronunciation
- URL: http://arxiv.org/abs/2204.06502v1
- Date: Wed, 13 Apr 2022 16:35:52 GMT
- Title: Study of Indian English Pronunciation Variabilities relative to Received
Pronunciation
- Authors: Priyanshi Pal, Shelly Jain, Anil Vuppala, Chiranjeevi Yarra, Prasanta
Ghosh
- Abstract summary: We consider a corpus, IndicTIMIT, which is rich in the diversity of IE varieties and is curated in a nativity balanced manner.
We present an approach to validate the phonetic rules of IE along with reporting unexplored rules derived using a data-driven manner.
- Score: 5.3956335232250385
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In contrast to British or American English, labeled pronunciation data on the
phonetic level is scarce for Indian English (IE). This has made it challenging
to study pronunciations of Indian English. Moreover, IE has many varieties,
resulting from various native language influences on L2 English. Indian English
has been studied in the past, by a few linguistic works. They report phonetic
rules for such characterisation, however, the extent to which they can be
applied to a diverse large-scale Indian pronunciation data remains
under-examined. We consider a corpus, IndicTIMIT, which is rich in the
diversity of IE varieties and is curated in a nativity balanced manner. It
contains data from 80 speakers corresponding to various regions of India. We
present an approach to validate the phonetic rules of IE along with reporting
unexplored rules derived using a data-driven manner, on this corpus. We also
provide quantitative information regarding which rules are more prominently
observed than the others, attributing to their relevance in IE accordingly.
Related papers
- LAHAJA: A Robust Multi-accent Benchmark for Evaluating Hindi ASR Systems [16.143694951047024]
We create a benchmark, LAHAJA, which contains read and extempore speech on a diverse set of topics and use cases.
We evaluate existing open-source and commercial models on LAHAJA and find their performance to be poor.
We train models using different datasets and find that our model trained on multilingual data with good speaker diversity outperforms existing models by a significant margin.
arXiv Detail & Related papers (2024-08-21T08:51:00Z) - Towards Better Inclusivity: A Diverse Tweet Corpus of English Varieties [0.0]
We aim to address the issue of bias at its root - the data itself.
We curate a dataset of tweets from countries with high proportions of underserved English variety speakers.
Following best annotation practices, our growing corpus features 170,800 tweets taken from 7 countries.
arXiv Detail & Related papers (2024-01-21T13:18:20Z) - Mukhyansh: A Headline Generation Dataset for Indic Languages [4.583536403673757]
Mukhyansh is an extensive multilingual dataset, tailored for Indian language headline generation.
Comprising over 3.39 million article-headline pairs, Mukhyansh spans across eight prominent Indian languages.
Mukhyansh outperforms all other models, achieving an average ROUGE-L score of 31.43 across all 8 languages.
arXiv Detail & Related papers (2023-11-29T15:49:24Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - A Deep Dive into the Disparity of Word Error Rates Across Thousands of
NPTEL MOOC Videos [4.809236881780707]
We describe the curation of a massive speech dataset of 8740 hours consisting of $sim9.8$K technical lectures in the English language along with their transcripts delivered by instructors representing various parts of Indian demography.
We use the curated dataset to measure the existing disparity in YouTube Automatic Captions and OpenAI Whisper model performance across the diverse demographic traits of speakers in India.
arXiv Detail & Related papers (2023-07-20T05:03:00Z) - Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP.
We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba.
Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region.
All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z) - An Investigation of Indian Native Language Phonemic Influences on L2
English Pronunciations [5.3956335232250385]
Growing number of L2 English speakers in India reinforces need to study accents and L1-L2 interactions.
We investigate the accents of Indian English (IE) speakers and report in detail our observations, both specific and common to all regions.
We demonstrate the influence of 18 Indian languages on IE by comparing the native language pronunciations with IE pronunciations obtained jointly from existing literature studies and phonetically annotated speech of 80 speakers.
arXiv Detail & Related papers (2022-12-19T07:41:39Z) - Utilizing Wordnets for Cognate Detection among Indian Languages [50.83320088758705]
We detect cognate word pairs among ten Indian languages with Hindi.
We use deep learning methodologies to predict whether a word pair is cognate or not.
We report improved performance of up to 26%.
arXiv Detail & Related papers (2021-12-30T16:46:28Z) - Harnessing Cross-lingual Features to Improve Cognate Detection for
Low-resource Languages [50.82410844837726]
We demonstrate the use of cross-lingual word embeddings for detecting cognates among fourteen Indian languages.
We evaluate our methods to detect cognates on a challenging dataset of twelve Indian languages.
We observe an improvement of up to 18% points, in terms of F-score, for cognate detection.
arXiv Detail & Related papers (2021-12-16T11:17:58Z) - A study on native American English speech recognition by Indian
listeners with varying word familiarity level [62.14295630922855]
We have three kinds of responses from each listener while they recognize an utterance.
From these transcriptions, word error rate (WER) is calculated and used as a metric to evaluate the similarity between the recognized and the original sentences.
Speaker nativity wise analysis shows that utterances from speakers of some nativity are more difficult to recognize by Indian listeners compared to few other nativities.
arXiv Detail & Related papers (2021-12-08T07:43:38Z) - Phoneme Recognition through Fine Tuning of Phonetic Representations: a
Case Study on Luhya Language Varieties [77.2347265289855]
We focus on phoneme recognition using Allosaurus, a method for multilingual recognition based on phonetic annotation.
To evaluate in a challenging real-world scenario, we curate phone recognition datasets for Bukusu and Saamia, two varieties of the Luhya language cluster of western Kenya and eastern Uganda.
We find that fine-tuning of Allosaurus, even with just 100 utterances, leads to significant improvements in phone error rates.
arXiv Detail & Related papers (2021-04-04T15:07:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.