Beyond Static Models and Test Sets: Benchmarking the Potential of
Pre-trained Models Across Tasks and Languages
- URL: http://arxiv.org/abs/2205.06356v1
- Date: Thu, 12 May 2022 20:42:48 GMT
- Title: Beyond Static Models and Test Sets: Benchmarking the Potential of
Pre-trained Models Across Tasks and Languages
- Authors: Kabir Ahuja, Sandipan Dandapat, Sunayana Sitaram, Monojit Choudhury
- Abstract summary: We argue that this makes the existing practices in multilingual evaluation unreliable and does not provide a full picture of the performance of MMLMs across the linguistic landscape.
We propose that the recent work done in Performance Prediction for NLP tasks can serve as a potential solution in fixing benchmarking in Multilingual NLP.
We compare performance prediction with translating test data with a case study on four different multilingual datasets, and observe that these methods can provide reliable estimates of the performance that are often on-par with the translation based approaches.
- Score: 15.373725507698591
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Although recent Massively Multilingual Language Models (MMLMs) like mBERT and
XLMR support around 100 languages, most existing multilingual NLP benchmarks
provide evaluation data in only a handful of these languages with little
linguistic diversity. We argue that this makes the existing practices in
multilingual evaluation unreliable and does not provide a full picture of the
performance of MMLMs across the linguistic landscape. We propose that the
recent work done in Performance Prediction for NLP tasks can serve as a
potential solution in fixing benchmarking in Multilingual NLP by utilizing
features related to data and language typology to estimate the performance of
an MMLM on different languages. We compare performance prediction with
translating test data with a case study on four different multilingual
datasets, and observe that these methods can provide reliable estimates of the
performance that are often on-par with the translation based approaches,
without the need for any additional translation as well as evaluation costs.
Related papers
- MMTEB: Massive Multilingual Text Embedding Benchmark [85.18187649328792]
We introduce the Massive Multilingual Text Embedding Benchmark (MMTEB)
MMTEB covers over 500 quality-controlled evaluation tasks across 250+ languages.
We develop several highly multilingual benchmarks, which we use to evaluate a representative set of models.
arXiv Detail & Related papers (2025-02-19T10:13:43Z) - LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models [89.13128402847943]
We present LUSIFER, a novel zero-shot approach that adapts LLM-based embedding models for multilingual tasks without requiring multilingual supervision.
LUSIFER's architecture combines a multilingual encoder, serving as a language-universal learner, with an LLM-based embedding model optimized for embedding-specific tasks.
We introduce a new benchmark encompassing 5 primary embedding tasks, 123 diverse datasets, and coverage across 14 languages.
arXiv Detail & Related papers (2025-01-01T15:43:07Z) - P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
Large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning.
Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks.
We present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks.
We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.
arXiv Detail & Related papers (2024-11-14T01:29:36Z) - On the Evaluation Practices in Multilingual NLP: Can Machine Translation Offer an Alternative to Human Translations? [19.346078451375693]
We present an analysis of existing evaluation frameworks in NLP.
We propose several directions for more robust and reliable evaluation practices.
We show that simpler baselines can achieve relatively strong performance without having benefited from large-scale multilingual pretraining.
arXiv Detail & Related papers (2024-06-20T12:46:12Z) - Bootstrapping Multilingual Semantic Parsers using Large Language Models [28.257114724384806]
translate-train paradigm of transferring English datasets across multiple languages remains to be the key ingredient for training task-specific multilingual models.
We consider the task of multilingual semantic parsing and demonstrate the effectiveness and flexibility offered by large language models (LLMs) for translating English datasets into several languages via few-shot prompting.
arXiv Detail & Related papers (2022-10-13T19:34:14Z) - Predicting the Performance of Multilingual NLP Models [16.250791929966685]
This paper proposes an alternate solution for evaluating a model across languages which make use of the existing performance scores of the model on languages that a particular task has test sets for.
We train a predictor on these performance scores and use this predictor to predict the model's performance in different evaluation settings.
Our results show that our method is effective in filling the gaps in the evaluation for an existing set of languages, but might require additional improvements if we want it to generalize to unseen languages.
arXiv Detail & Related papers (2021-10-17T17:36:53Z) - Language Models are Few-shot Multilingual Learners [66.11011385895195]
We evaluate the multilingual skills of the GPT and T5 models in conducting multi-class classification on non-English languages.
We show that, given a few English examples as context, pre-trained language models can predict not only English test samples but also non-English ones.
arXiv Detail & Related papers (2021-09-16T03:08:22Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating
Cross-lingual Generalization [128.37244072182506]
Cross-lingual TRansfer Evaluation of Multilinguals XTREME is a benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks.
We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models.
arXiv Detail & Related papers (2020-03-24T19:09:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.