On the Economics of Multilingual Few-shot Learning: Modeling the
Cost-Performance Trade-offs of Machine Translated and Manual Data
- URL: http://arxiv.org/abs/2205.06350v1
- Date: Thu, 12 May 2022 20:27:01 GMT
- Title: On the Economics of Multilingual Few-shot Learning: Modeling the
Cost-Performance Trade-offs of Machine Translated and Manual Data
- Authors: Kabir Ahuja, Monojit Choudhury, Sandipan Dandapat
- Abstract summary: We introduce a framework to evaluate the performance and cost trade-offs between machine-translated and manually-created labelled data.
We illustrate the effectiveness of our framework through a case-study on the TyDIQA-GoldP dataset.
- Score: 12.638781962950805
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Borrowing ideas from {\em Production functions} in micro-economics, in this
paper we introduce a framework to systematically evaluate the performance and
cost trade-offs between machine-translated and manually-created labelled data
for task-specific fine-tuning of massively multilingual language models. We
illustrate the effectiveness of our framework through a case-study on the
TyDIQA-GoldP dataset. One of the interesting conclusions of the study is that
if the cost of machine translation is greater than zero, the optimal
performance at least cost is always achieved with at least some or only
manually-created data. To our knowledge, this is the first attempt towards
extending the concept of production functions to study data collection
strategies for training multilingual models, and can serve as a valuable tool
for other similar cost vs data trade-offs in NLP.
Related papers
- Improving Training Efficiency and Reducing Maintenance Costs via Language Specific Model Merging [1.8165993946919816]
Fine-tuning a task-specific multilingual large language model (LLM) involves training the model on a multilingual dataset with examples in all the required languages.<n> Updating one or more supported languages with additional data or adding support for a new language involves retraining the model, which can be computationally inefficient and creates a severe maintenance bottleneck.<n>Recent research on merging multilingual models has shown promise in terms of improved quality, but its computational and maintenance efficiency remains unstudied.
arXiv Detail & Related papers (2026-01-22T17:28:24Z) - Bridging the Data Gap: Creating a Hindi Text Summarization Dataset from the English XSUM [2.893226191913102]
This study introduces a cost-effective, automated framework for creating a comprehensive Hindi text summarization dataset.<n>By leveraging the English Extreme Summarization (XSUM) dataset as a source, we employ advanced translation and linguistic adaptation techniques.<n>The resulting dataset provides a diverse, multi-thematic resource that mirrors the complexity of the original XSUM corpus.
arXiv Detail & Related papers (2026-01-04T14:38:58Z) - Dynamic Jointly Batch Selection for Data Efficient Machine Translation Fine-Tuning [2.016758225924076]
This paper presents a data selection methodology specifically designed for fine-tuning machine translation systems.<n>By defining a learnability score, our approach systematically evaluates the utility of data points for training.<n> Experiments on English to Persian and several other language pairs using an mBART model fine-tuned on the CCMatrix dataset demonstrate that our method can achieve up to a fivefold improvement in data efficiency.
arXiv Detail & Related papers (2025-11-06T14:33:29Z) - P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
Large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning.
Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks.
We present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks.
We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.
arXiv Detail & Related papers (2024-11-14T01:29:36Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - Data-Driven Approach for Formality-Sensitive Machine Translation:
Language-Specific Handling and Synthetic Data Generation [5.536220901048185]
We introduce a data-driven approach for Formality-Sensitive Machine Translation (FSMT) that caters to the unique linguistic properties of four target languages.
Our methodology centers on two core strategies: 1) language-specific data handling, and 2) synthetic data generation using large-scale language models and empirical prompt engineering.
arXiv Detail & Related papers (2023-06-26T08:45:47Z) - MiniSUPERB: Lightweight Benchmark for Self-supervised Speech Models [90.99663022952498]
SuperB was proposed to evaluate the generalizability of self-supervised learning (SSL) speech models across various tasks.
SuperB incurs high computational costs due to the large datasets and diverse tasks.
We introduce MiniSUPERB, a lightweight benchmark that efficiently evaluates SSL speech models with comparable results to SUPERB but lower computational costs significantly.
arXiv Detail & Related papers (2023-05-30T13:07:33Z) - Leveraging Synthetic Targets for Machine Translation [5.302421715411791]
We show that training models on synthetic targets outperforms training on the actual ground-truth data.
We provide preliminary analysis into whether this boost in performance is linked to ease of optimization or more deterministic nature of the predictions.
arXiv Detail & Related papers (2023-05-07T07:42:22Z) - Unified Model Learning for Various Neural Machine Translation [63.320005222549646]
Existing machine translation (NMT) studies mainly focus on developing dataset-specific models.
We propose a versatile'' model, i.e., the Unified Model Learning for NMT (UMLNMT) that works with data from different tasks.
OurNMT results in substantial improvements over dataset-specific models with significantly reduced model deployment costs.
arXiv Detail & Related papers (2023-05-04T12:21:52Z) - A Comparative Study between Full-Parameter and LoRA-based Fine-Tuning on
Chinese Instruction Data for Instruction Following Large Language Model [8.21938165599387]
The selection of the foundational model, training dataset scale, learnable parameter quantity, and model training cost are all important factors.
To facilitate the reproduction of the paper's results, the dataset, model and code will be released.
arXiv Detail & Related papers (2023-04-17T09:36:36Z) - Super-Prompting: Utilizing Model-Independent Contextual Data to Reduce
Data Annotation Required in Visual Commonsense Tasks [3.42658286826597]
We analyze different prompt-based fine-tuning techniques to improve results on both language and multimodal causal transformer models.
Our results show that by simple model-agnostic prompt-based fine-tuning, comparable results can be reached by only using 35%-40% of the fine-tuning training dataset.
arXiv Detail & Related papers (2022-04-25T18:56:55Z) - Distributionally Robust Multilingual Machine Translation [94.51866646879337]
We propose a new learning objective for Multilingual neural machine translation (MNMT) based on distributionally robust optimization.
We show how to practically optimize this objective for large translation corpora using an iterated best response scheme.
Our method consistently outperforms strong baseline methods in terms of average and per-language performance under both many-to-one and one-to-many translation settings.
arXiv Detail & Related papers (2021-09-09T03:48:35Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.