Related papers: Codeswitched Sentence Creation using Dependency Parsing

Codeswitched Sentence Creation using Dependency Parsing

URL: http://arxiv.org/abs/2012.02990v1
Date: Sat, 5 Dec 2020 10:00:06 GMT
Title: Codeswitched Sentence Creation using Dependency Parsing
Authors: Dhruval Jain, Arun D Prabhu, Shubham Vatsal, Gopi Ramena, Naresh Purre
Abstract summary: Codeswitching has become one of the most common occurrences across multilingual speakers of the world. We present a novel algorithm which harnesses the syntactic structure of English grammar to develop grammatically sensible Codeswitched versions of English-Hindi, English-Marathi and English-Kannada data.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Codeswitching has become one of the most common occurrences across multilingual speakers of the world, especially in countries like India which encompasses around 23 official languages with the number of bilingual speakers being around 300 million. The scarcity of Codeswitched data becomes a bottleneck in the exploration of this domain with respect to various Natural Language Processing (NLP) tasks. We thus present a novel algorithm which harnesses the syntactic structure of English grammar to develop grammatically sensible Codeswitched versions of English-Hindi, English-Marathi and English-Kannada data. Apart from maintaining the grammatical sanity to a great extent, our methodology also guarantees abundant generation of data from a minuscule snapshot of given data. We use multiple datasets to showcase the capabilities of our algorithm while at the same time we assess the quality of generated Codeswitched data using some qualitative metrics along with providing baseline results for couple of NLP tasks.

Related papers

Investigating and Scaling up Code-Switching for Multilingual Language Model Pre-Training [58.696660064190475]
We find that the existence of code-switching, alternating between different languages within a context, is key to multilingual capabilities. To better explore the power of code-switching for language alignment during pre-training, we investigate the strategy of synthetic code-switching.
arXiv Detail & Related papers (2025-04-02T15:09:58Z)
COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing [1.3062731746155414]
COMI-LINGUA is the largest manually annotated dataset for code-mixed text, comprising 100,970 instances evaluated by three expert annotators in both Devanagari and Roman scripts. The dataset supports five fundamental NLP tasks: Language Identification, Matrix Language Identification, Part-of-Speech Tagging, Named Entity Recognition, and Translation. We evaluate LLMs on these tasks using COMILINGUA, revealing limitations in current multilingual modeling strategies and emphasizing the need for improved code-mixed text processing capabilities.
arXiv Detail & Related papers (2025-03-27T16:36:39Z)
CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text. We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules. COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z)
Prompting Towards Alleviating Code-Switched Data Scarcity in Under-Resourced Languages with GPT as a Pivot [1.3741556944830366]
This study prompted GPT 3.5 to generate Afrikaans--English and Yoruba--English code-switched sentences. The quality of generated sentences for languages using non-Latin scripts, like Yoruba, is considerably lower when compared with the high Afrikaans-English success rate. We propose a framework for augmenting the diversity of synthetically generated code-switched data using GPT.
arXiv Detail & Related papers (2024-04-26T07:44:44Z)
Multilingual self-supervised speech representations improve the speech recognition of low-resource African languages with codeswitching [65.74653592668743]
Finetuning self-supervised multilingual representations reduces absolute word error rates by up to 20%. In circumstances with limited training data finetuning self-supervised representations is a better performing and viable solution.
arXiv Detail & Related papers (2023-11-25T17:05:21Z)
Speech collage: code-switched audio generation by collaging monolingual corpora [50.356820349870986]
Speech Collage is a method that synthesizes CS data from monolingual corpora by splicing audio segments. We investigate the impact of generated data on speech recognition in two scenarios.
arXiv Detail & Related papers (2023-09-27T14:17:53Z)
Simple yet Effective Code-Switching Language Identification with Multitask Pre-Training and Transfer Learning [0.7242530499990028]
Code-switching is the linguistics phenomenon where in casual settings, multilingual speakers mix words from different languages in one utterance. We propose two novel approaches toward improving language identification accuracy on an English-Mandarin child-directed speech dataset. Our best model achieves a balanced accuracy of 0.781 on a real English-Mandarin code-switching child-directed speech corpus and outperforms the previous baseline by 55.3%.
arXiv Detail & Related papers (2023-05-31T11:43:16Z)
Adversarial synthesis based data-augmentation for code-switched spoken language identification [0.0]
Spoken Language Identification (LID) is an important sub-task of Automatic Speech Recognition (ASR) This study focuses on Indic language code-mixed with English. Generative Adversarial Network (GAN) based data augmentation technique performed using Mel spectrograms for audio data.
arXiv Detail & Related papers (2022-05-30T06:41:13Z)
MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages [76.93265104421559]
We benchmark code generation from natural language commands extending beyond English. We annotated a total of 896 NL-code pairs in three languages: Spanish, Japanese, and Russian. While the difficulties vary across these three languages, all systems lag significantly behind their English counterparts.
arXiv Detail & Related papers (2022-03-16T04:21:50Z)
Reducing language context confusion for end-to-end code-switching automatic speech recognition [50.89821865949395]
We propose a language-related attention mechanism to reduce multilingual context confusion for the E2E code-switching ASR model. By calculating the respective attention of multiple languages, our method can efficiently transfer language knowledge from rich monolingual data.
arXiv Detail & Related papers (2022-01-28T14:39:29Z)
From Machine Translation to Code-Switching: Generating High-Quality Code-Switched Text [14.251949110756078]
We adapt a state-of-the-art neural machine translation model to generate Hindi-English code-switched sentences. We show significant reductions in perplexity on a language modeling task. We also show improvements using our text for a downstream code-switched natural language inference task.
arXiv Detail & Related papers (2021-07-14T04:46:39Z)
Multilingual and code-switching ASR challenges for low resource Indian languages [59.2906853285309]
We focus on building multilingual and code-switching ASR systems through two different subtasks related to a total of seven Indian languages. We provide a total of 600 hours of transcribed speech data, comprising train and test sets, in these languages. We also provide a baseline recipe for both the tasks with a WER of 30.73% and 32.45% on the test sets of multilingual and code-switching subtasks, respectively.
arXiv Detail & Related papers (2021-04-01T03:37:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.