Codeswitched Sentence Creation using Dependency Parsing
- URL: http://arxiv.org/abs/2012.02990v1
- Date: Sat, 5 Dec 2020 10:00:06 GMT
- Title: Codeswitched Sentence Creation using Dependency Parsing
- Authors: Dhruval Jain, Arun D Prabhu, Shubham Vatsal, Gopi Ramena, Naresh Purre
- Abstract summary: Codeswitching has become one of the most common occurrences across multilingual speakers of the world.
We present a novel algorithm which harnesses the syntactic structure of English grammar to develop grammatically sensible Codeswitched versions of English-Hindi, English-Marathi and English-Kannada data.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Codeswitching has become one of the most common occurrences across
multilingual speakers of the world, especially in countries like India which
encompasses around 23 official languages with the number of bilingual speakers
being around 300 million. The scarcity of Codeswitched data becomes a
bottleneck in the exploration of this domain with respect to various Natural
Language Processing (NLP) tasks. We thus present a novel algorithm which
harnesses the syntactic structure of English grammar to develop grammatically
sensible Codeswitched versions of English-Hindi, English-Marathi and
English-Kannada data. Apart from maintaining the grammatical sanity to a great
extent, our methodology also guarantees abundant generation of data from a
minuscule snapshot of given data. We use multiple datasets to showcase the
capabilities of our algorithm while at the same time we assess the quality of
generated Codeswitched data using some qualitative metrics along with providing
baseline results for couple of NLP tasks.
Related papers
- CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text.
We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules.
COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z) - Prompting Towards Alleviating Code-Switched Data Scarcity in Under-Resourced Languages with GPT as a Pivot [1.3741556944830366]
This study prompted GPT 3.5 to generate Afrikaans--English and Yoruba--English code-switched sentences.
The quality of generated sentences for languages using non-Latin scripts, like Yoruba, is considerably lower when compared with the high Afrikaans-English success rate.
We propose a framework for augmenting the diversity of synthetically generated code-switched data using GPT.
arXiv Detail & Related papers (2024-04-26T07:44:44Z) - Multilingual self-supervised speech representations improve the speech
recognition of low-resource African languages with codeswitching [65.74653592668743]
Finetuning self-supervised multilingual representations reduces absolute word error rates by up to 20%.
In circumstances with limited training data finetuning self-supervised representations is a better performing and viable solution.
arXiv Detail & Related papers (2023-11-25T17:05:21Z) - Speech collage: code-switched audio generation by collaging monolingual
corpora [50.356820349870986]
Speech Collage is a method that synthesizes CS data from monolingual corpora by splicing audio segments.
We investigate the impact of generated data on speech recognition in two scenarios.
arXiv Detail & Related papers (2023-09-27T14:17:53Z) - Simple yet Effective Code-Switching Language Identification with
Multitask Pre-Training and Transfer Learning [0.7242530499990028]
Code-switching is the linguistics phenomenon where in casual settings, multilingual speakers mix words from different languages in one utterance.
We propose two novel approaches toward improving language identification accuracy on an English-Mandarin child-directed speech dataset.
Our best model achieves a balanced accuracy of 0.781 on a real English-Mandarin code-switching child-directed speech corpus and outperforms the previous baseline by 55.3%.
arXiv Detail & Related papers (2023-05-31T11:43:16Z) - Adversarial synthesis based data-augmentation for code-switched spoken
language identification [0.0]
Spoken Language Identification (LID) is an important sub-task of Automatic Speech Recognition (ASR)
This study focuses on Indic language code-mixed with English.
Generative Adversarial Network (GAN) based data augmentation technique performed using Mel spectrograms for audio data.
arXiv Detail & Related papers (2022-05-30T06:41:13Z) - MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages [76.93265104421559]
We benchmark code generation from natural language commands extending beyond English.
We annotated a total of 896 NL-code pairs in three languages: Spanish, Japanese, and Russian.
While the difficulties vary across these three languages, all systems lag significantly behind their English counterparts.
arXiv Detail & Related papers (2022-03-16T04:21:50Z) - Reducing language context confusion for end-to-end code-switching
automatic speech recognition [50.89821865949395]
We propose a language-related attention mechanism to reduce multilingual context confusion for the E2E code-switching ASR model.
By calculating the respective attention of multiple languages, our method can efficiently transfer language knowledge from rich monolingual data.
arXiv Detail & Related papers (2022-01-28T14:39:29Z) - From Machine Translation to Code-Switching: Generating High-Quality
Code-Switched Text [14.251949110756078]
We adapt a state-of-the-art neural machine translation model to generate Hindi-English code-switched sentences.
We show significant reductions in perplexity on a language modeling task.
We also show improvements using our text for a downstream code-switched natural language inference task.
arXiv Detail & Related papers (2021-07-14T04:46:39Z) - Multilingual and code-switching ASR challenges for low resource Indian
languages [59.2906853285309]
We focus on building multilingual and code-switching ASR systems through two different subtasks related to a total of seven Indian languages.
We provide a total of 600 hours of transcribed speech data, comprising train and test sets, in these languages.
We also provide a baseline recipe for both the tasks with a WER of 30.73% and 32.45% on the test sets of multilingual and code-switching subtasks, respectively.
arXiv Detail & Related papers (2021-04-01T03:37:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.