Data-Driven Approach for Formality-Sensitive Machine Translation:
Language-Specific Handling and Synthetic Data Generation
- URL: http://arxiv.org/abs/2306.14514v2
- Date: Tue, 27 Jun 2023 06:59:28 GMT
- Title: Data-Driven Approach for Formality-Sensitive Machine Translation:
Language-Specific Handling and Synthetic Data Generation
- Authors: Seugnjun Lee, Hyeonseok Moon, Chanjun Park, Heuiseok Lim
- Abstract summary: We introduce a data-driven approach for Formality-Sensitive Machine Translation (FSMT) that caters to the unique linguistic properties of four target languages.
Our methodology centers on two core strategies: 1) language-specific data handling, and 2) synthetic data generation using large-scale language models and empirical prompt engineering.
- Score: 5.536220901048185
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we introduce a data-driven approach for Formality-Sensitive
Machine Translation (FSMT) that caters to the unique linguistic properties of
four target languages. Our methodology centers on two core strategies: 1)
language-specific data handling, and 2) synthetic data generation using
large-scale language models and empirical prompt engineering. This approach
demonstrates a considerable improvement over the baseline, highlighting the
effectiveness of data-centric techniques. Our prompt engineering strategy
further improves performance by producing superior synthetic translation
examples.
Related papers
- Enhancing SLM via ChatGPT and Dataset Augmentation [0.3844771221441211]
We employ knowledge distillation-based techniques and synthetic dataset augmentation to bridge the performance gap between large language models (LLMs) and small language models (SLMs)
Our methods involve two forms of rationale generation--information extraction and informed reasoning--to enrich the ANLI dataset.
Our findings reveal that the incorporation of synthetic rationales significantly improves the model's ability to comprehend natural language, leading to 1.3% and 2.3% higher classification accuracy, respectively, on the ANLI dataset.
arXiv Detail & Related papers (2024-09-19T09:24:36Z) - Instruction Data Generation and Unsupervised Adaptation for Speech Language Models [21.56355461403427]
We propose three methods for generating synthetic samples to train and evaluate multimodal large language models.
Synthetic data generation emerges as a crucial strategy to enhance the performance of such systems.
We highlight the potential of using unlabeled speech data to generate synthetic samples comparable in quality to those with available transcriptions.
arXiv Detail & Related papers (2024-06-18T08:27:00Z) - Curating Grounded Synthetic Data with Global Perspectives for Equitable AI [0.5120567378386615]
We introduce a novel approach to creating synthetic datasets, grounded in real-world diversity and enriched through strategic diversification.
We synthesize data using a comprehensive collection of news articles spanning 12 languages and originating from 125 countries, to ensure a breadth of linguistic and cultural representations.
Preliminary results demonstrate substantial improvements in performance on traditional NER benchmarks, by up to 7.3%.
arXiv Detail & Related papers (2024-06-10T17:59:11Z) - Learning Phonotactics from Linguistic Informants [54.086544221761486]
Our model iteratively selects or synthesizes a data-point according to one of a range of information-theoretic policies.
We find that the information-theoretic policies that our model uses to select items to query the informant achieve sample efficiency comparable to, or greater than, fully supervised approaches.
arXiv Detail & Related papers (2024-05-08T00:18:56Z) - A Morphologically-Aware Dictionary-based Data Augmentation Technique for
Machine Translation of Under-Represented Languages [31.18983138590214]
We propose strategies to synthesize parallel data relying on morpho-syntactic information and using bilingual lexicons.
Our methodology adheres to a realistic scenario backed by the small parallel seed data.
It is linguistically informed, as it aims to create augmented data that is more likely to be grammatically correct.
arXiv Detail & Related papers (2024-02-02T22:25:44Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - On the Economics of Multilingual Few-shot Learning: Modeling the
Cost-Performance Trade-offs of Machine Translated and Manual Data [12.638781962950805]
We introduce a framework to evaluate the performance and cost trade-offs between machine-translated and manually-created labelled data.
We illustrate the effectiveness of our framework through a case-study on the TyDIQA-GoldP dataset.
arXiv Detail & Related papers (2022-05-12T20:27:01Z) - Improving Neural Machine Translation by Bidirectional Training [85.64797317290349]
We present a simple and effective pretraining strategy -- bidirectional training (BiT) for neural machine translation.
Specifically, we bidirectionally update the model parameters at the early stage and then tune the model normally.
Experimental results show that BiT pushes the SOTA neural machine translation performance across 15 translation tasks on 8 language pairs significantly higher.
arXiv Detail & Related papers (2021-09-16T07:58:33Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z) - Data Augmentation for Spoken Language Understanding via Pretrained
Language Models [113.56329266325902]
Training of spoken language understanding (SLU) models often faces the problem of data scarcity.
We put forward a data augmentation method using pretrained language models to boost the variability and accuracy of generated utterances.
arXiv Detail & Related papers (2020-04-29T04:07:12Z) - Dynamic Data Selection and Weighting for Iterative Back-Translation [116.14378571769045]
We propose a curriculum learning strategy for iterative back-translation models.
We evaluate our models on domain adaptation, low-resource, and high-resource MT settings.
Experimental results demonstrate that our methods achieve improvements of up to 1.8 BLEU points over competitive baselines.
arXiv Detail & Related papers (2020-04-07T19:49:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.