VietMix: A Naturally Occurring Vietnamese-English Code-Mixed Corpus with Iterative Augmentation for Machine Translation
- URL: http://arxiv.org/abs/2505.24472v1
- Date: Fri, 30 May 2025 11:18:10 GMT
- Title: VietMix: A Naturally Occurring Vietnamese-English Code-Mixed Corpus with Iterative Augmentation for Machine Translation
- Authors: Hieu Tran, Phuong-Anh Nguyen-Le, Huy Nghiem, Quang-Nhan Nguyen, Wei Ai, Marine Carpuat,
- Abstract summary: Machine translation systems fail when processing code-mixed inputs for low-resource languages.<n>We address this challenge by curating VietMix, a parallel corpus of naturally occurring code-mixed Vietnamese text paired with expert English translations.<n>Augmenting this resource, we developed a complementary synthetic data generation pipeline.
- Score: 13.047103277038175
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine translation systems fail when processing code-mixed inputs for low-resource languages. We address this challenge by curating VietMix, a parallel corpus of naturally occurring code-mixed Vietnamese text paired with expert English translations. Augmenting this resource, we developed a complementary synthetic data generation pipeline. This pipeline incorporates filtering mechanisms to ensure syntactic plausibility and pragmatic appropriateness in code-mixing patterns. Experimental validation shows our naturalistic and complementary synthetic data boost models' performance, measured by translation quality estimation scores, of up to 71.84 on COMETkiwi and 81.77 on XCOMET. Triangulating positive results with LLM-based assessments, augmented models are favored over seed fine-tuned counterparts in approximately 49% of judgments (54-56% excluding ties). VietMix and our augmentation methodology advance ecological validity in neural MT evaluations and establish a framework for addressing code-mixed translation challenges across other low-resource pairs.
Related papers
- GreenMind: A Next-Generation Vietnamese Large Language Model for Structured and Logical Reasoning [0.0]
Chain-of-Thought (CoT) is a robust approach for tackling LLM tasks that require intermediate reasoning steps prior to generating a final answer.<n>We present GreenMind-Medium-14B-R1, a Vietnamese reasoning model inspired by the finetuning strategy based on Group Relative Policy Optimization.
arXiv Detail & Related papers (2025-04-23T15:48:55Z) - Keyword Extraction, and Aspect Classification in Sinhala, English, and Code-Mixed Content [0.0]
This study introduces a hybrid NLP method to improve keyword extraction, content filtering, and aspect-based classification of banking content.<n>The present framework offers an accurate and scalable solution for brand reputation monitoring in code-mixed and low-resource banking environments.
arXiv Detail & Related papers (2025-04-14T20:01:34Z) - FUSE : A Ridge and Random Forest-Based Metric for Evaluating MT in Indigenous Languages [2.377892000761193]
This paper presents the winning submission of the RaaVa team to the Americas 2025 Shared Task 3 on Automatic Evaluation Metrics for Machine Translation.<n>We introduce Feature-Union Scorer (FUSE) for Evaluation, FUSE integrates Ridge regression and Gradient Boosting to model translation quality.<n>Results show that FUSE consistently achieves higher Pearson and Spearman correlations with human judgments.
arXiv Detail & Related papers (2025-03-28T06:58:55Z) - Understanding In-Context Machine Translation for Low-Resource Languages: A Case Study on Manchu [53.437954702561065]
In-context machine translation (MT) with large language models (LLMs) is a promising approach for low-resource MT.<n>This study systematically investigates how each type of resource, e.g., dictionary, grammar book, and retrieved parallel examples, affect the translation performance.<n>Our results indicate that high-quality dictionaries and good parallel examples are very helpful, while grammars hardly help.
arXiv Detail & Related papers (2025-02-17T14:53:49Z) - Synthetic Data Generation and Joint Learning for Robust Code-Mixed Translation [34.57825234659946]
We tackle the problem of code-mixed (Hinglish and Bengalish) to English machine translation.
We propose RCMT, a robust perturbation based joint-training model that learns to handle noise in the real-world code-mixed text.
Our evaluation and comprehensive analyses demonstrate the superiority of RCMT over state-of-the-art code-mixed and robust translation methods.
arXiv Detail & Related papers (2024-03-25T13:50:11Z) - Non-Fluent Synthetic Target-Language Data Improve Neural Machine
Translation [0.0]
We show that synthetic training samples with non-fluent target sentences can improve translation performance.
This improvement is independent of the size of the original training corpus.
arXiv Detail & Related papers (2024-01-29T11:52:45Z) - mFACE: Multilingual Summarization with Factual Consistency Evaluation [79.60172087719356]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets.
Despite promising results, current models still suffer from generating factually inconsistent summaries.
We leverage factual consistency evaluation models to improve multilingual summarization.
arXiv Detail & Related papers (2022-12-20T19:52:41Z) - Categorizing Semantic Representations for Neural Machine Translation [53.88794787958174]
We introduce categorization to the source contextualized representations.
The main idea is to enhance generalization by reducing sparsity and overfitting.
Experiments on a dedicated MT dataset show that our method reduces compositional generalization error rates by 24% error reduction.
arXiv Detail & Related papers (2022-10-13T04:07:08Z) - Self-Training Sampling with Monolingual Data Uncertainty for Neural
Machine Translation [98.83925811122795]
We propose to improve the sampling procedure by selecting the most informative monolingual sentences to complement the parallel data.
We compute the uncertainty of monolingual sentences using the bilingual dictionary extracted from the parallel data.
Experimental results on large-scale WMT English$Rightarrow$German and English$Rightarrow$Chinese datasets demonstrate the effectiveness of the proposed approach.
arXiv Detail & Related papers (2021-06-02T05:01:36Z) - Sequence-Level Mixed Sample Data Augmentation [119.94667752029143]
This work proposes a simple data augmentation approach to encourage compositional behavior in neural models for sequence-to-sequence problems.
Our approach, SeqMix, creates new synthetic examples by softly combining input/output sequences from the training set.
arXiv Detail & Related papers (2020-11-18T02:18:04Z) - Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner
Party Transcription [73.66530509749305]
In this paper, we argue that, even in difficult cases, some end-to-end approaches show performance close to the hybrid baseline.
We experimentally compare and analyze CTC-Attention versus RNN-Transducer approaches along with RNN versus Transformer architectures.
Our best end-to-end model based on RNN-Transducer, together with improved beam search, reaches quality by only 3.8% WER abs. worse than the LF-MMI TDNN-F CHiME-6 Challenge baseline.
arXiv Detail & Related papers (2020-04-22T19:08:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.