From Machine Translation to Code-Switching: Generating High-Quality
Code-Switched Text
- URL: http://arxiv.org/abs/2107.06483v1
- Date: Wed, 14 Jul 2021 04:46:39 GMT
- Title: From Machine Translation to Code-Switching: Generating High-Quality
Code-Switched Text
- Authors: Ishan Tarunesh, Syamantak Kumar, Preethi Jyothi
- Abstract summary: We adapt a state-of-the-art neural machine translation model to generate Hindi-English code-switched sentences.
We show significant reductions in perplexity on a language modeling task.
We also show improvements using our text for a downstream code-switched natural language inference task.
- Score: 14.251949110756078
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generating code-switched text is a problem of growing interest, especially
given the scarcity of corpora containing large volumes of real code-switched
text. In this work, we adapt a state-of-the-art neural machine translation
model to generate Hindi-English code-switched sentences starting from
monolingual Hindi sentences. We outline a carefully designed curriculum of
pretraining steps, including the use of synthetic code-switched text, that
enable the model to generate high-quality code-switched text. Using text
generated from our model as data augmentation, we show significant reductions
in perplexity on a language modeling task, compared to using text from other
generative models of CS text. We also show improvements using our text for a
downstream code-switched natural language inference task. Our generated text is
further subjected to a rigorous evaluation using a human evaluation study and a
range of objective metrics, where we show performance comparable (and sometimes
even superior) to code-switched text obtained via crowd workers who are native
Hindi speakers.
Related papers
- CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text.
We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules.
COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z) - T2S-GPT: Dynamic Vector Quantization for Autoregressive Sign Language Production from Text [59.57676466961787]
We propose a novel dynamic vector quantization (DVA-VAE) model that can adjust the encoding length based on the information density in sign language.
Experiments conducted on the PHOENIX14T dataset demonstrate the effectiveness of our proposed method.
We propose a new large German sign language dataset, PHOENIX-News, which contains 486 hours of sign language videos, audio, and transcription texts.
arXiv Detail & Related papers (2024-06-11T10:06:53Z) - Code-Mixed Probes Show How Pre-Trained Models Generalise On Code-Switched Text [1.9185059111021852]
We investigate how pre-trained Language Models handle code-switched text in three dimensions.
Our findings reveal that pre-trained language models are effective in generalising to code-switched text.
arXiv Detail & Related papers (2024-03-07T19:46:03Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - Improving Code-switching Language Modeling with Artificially Generated
Texts using Cycle-consistent Adversarial Networks [41.88097793717185]
We investigate methods to augment Code-switching training text data by artificially generating them.
We propose a cycle-consistent adversarial networks based framework to transfer monolingual text into Code-switching text.
arXiv Detail & Related papers (2021-12-12T21:27:32Z) - HinGE: A Dataset for Generation and Evaluation of Code-Mixed Hinglish
Text [1.6675267471157407]
We present a corpus (HinGE) for a widely popular code-mixed language Hinglish (code-mixing of Hindi and English languages)
HinGE has Hinglish sentences generated by humans as well as two rule-based algorithms corresponding to the parallel Hindi-English sentences.
In addition, we demonstrate the inefficacy of widely-used evaluation metrics on the code-mixed data.
arXiv Detail & Related papers (2021-07-08T11:11:37Z) - Sentiment Analysis of Persian-English Code-mixed Texts [0.0]
Due to the unstructured nature of social media data, we are observing more instances of multilingual and code-mixed texts.
In this study we collect, label and thus create a dataset of Persian-English code-mixed tweets.
We introduce a model which uses BERT pretrained embeddings as well as translation models to automatically learn the polarity scores of these Tweets.
arXiv Detail & Related papers (2021-02-25T06:05:59Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - Adversarial Watermarking Transformer: Towards Tracing Text Provenance
with Data Hiding [80.3811072650087]
We study natural language watermarking as a defense to help better mark and trace the provenance of text.
We introduce the Adversarial Watermarking Transformer (AWT) with a jointly trained encoder-decoder and adversarial training.
AWT is the first end-to-end model to hide data in text by automatically learning -- without ground truth -- word substitutions along with their locations.
arXiv Detail & Related papers (2020-09-07T11:01:24Z) - Text Data Augmentation: Towards better detection of spear-phishing
emails [1.6556358263455926]
We propose a corpus and task augmentation framework to augment English texts within our company.
Our proposal combines different methods, utilizing BERT language model, multi-step back-translation and agnostics.
We show that our augmentation framework improves performances on several text classification tasks using publicly available models and corpora.
arXiv Detail & Related papers (2020-07-04T07:45:04Z) - A Multi-Perspective Architecture for Semantic Code Search [58.73778219645548]
We propose a novel multi-perspective cross-lingual neural framework for code--text matching.
Our experiments on the CoNaLa dataset show that our proposed model yields better performance than previous approaches.
arXiv Detail & Related papers (2020-05-06T04:46:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.