Related papers: PoETa v2: Toward More Robust Evaluation of Large Language Models in Portuguese

PoETa v2: Toward More Robust Evaluation of Large Language Models in Portuguese

URL: http://arxiv.org/abs/2511.17808v2
Date: Wed, 26 Nov 2025 20:44:28 GMT
Title: PoETa v2: Toward More Robust Evaluation of Large Language Models in Portuguese
Authors: Thales Sales Almeida, Ramon Pires, Hugo Abonizio, Rodrigo Nogueira, Hélio Pedrini,
Abstract summary: We present the most extensive evaluation of Large Language Models (LLMs) for the Portuguese language to date.<n>We assess more than 20 models covering a broad spectrum of training scales and computational resources.<n>Through this benchmark and analysis, PoETa v2 lays the groundwork for future research on Portuguese language modeling and evaluation.
Score: 13.666117502565827
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) exhibit significant variations in performance across linguistic and cultural contexts, underscoring the need for systematic evaluation in diverse languages. In this work, we present the most extensive evaluation of LLMs for the Portuguese language to date. Leveraging our newly introduced PoETa v2 benchmark -- a comprehensive suite of over 40 tasks in Portuguese -- we assess more than 20 models covering a broad spectrum of training scales and computational resources. Our study reveals how computational investment and language-specific adaptation impact performance in Portuguese, while also analyzing performance gaps in comparison to equivalent tasks in English. Through this benchmark and analysis, PoETa v2 lays the groundwork for future research on Portuguese language modeling and evaluation. The benchmark is available at https://github.com/PoETaV2/PoETaV2.

Related papers

MuBench: Assessment of Multilingual Capabilities of Large Language Models Across 61 Languages [33.450081592217074]
We introduce MuBench, a benchmark covering 61 languages and evaluating a broad range of capabilities.<n>We evaluate several state-of-the-art multilingual LLMs and find notable gaps between claimed and actual language coverage.
arXiv Detail & Related papers (2025-06-24T09:53:00Z)
Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering [73.73820209993515]
We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs)<n>Inspired by existing research, we created the question set with features such as single knowledge point coverage, absolute objectivity, unique answers, and temporal stability.<n>Results show significant performance differences between the two domains.
arXiv Detail & Related papers (2025-05-22T12:27:02Z)
Analysis of Indic Language Capabilities in LLMs [0.3599866690398789]
This report evaluates the performance of text-in text-out Large Language Models (LLMs) to understand and generate Indic languages.<n> Hindi is the most widely represented language in models.<n>While model performance roughly correlates with number of speakers for the top five languages, the assessment after that varies.
arXiv Detail & Related papers (2025-01-23T18:49:33Z)
P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.<n>P-MMEval delivers consistent language coverage across various datasets and provides parallel samples.<n>We conduct extensive experiments on representative multilingual model series to compare performances across models and tasks.
arXiv Detail & Related papers (2024-11-14T01:29:36Z)
From Brazilian Portuguese to European Portuguese [2.048226951354646]
Brazilian Portuguese and European Portuguese are two varieties of the same language. There is a significant disproportion in the availability of resources between the two variants. This inequity can impact the quality of translation services accessible to European Portuguese speakers.
arXiv Detail & Related papers (2024-08-14T10:58:48Z)
The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.40766216456413]
We propose a question alignment framework to bridge the gap between large language models' English and non-English performance. Experiment results show it can boost multilingual performance across diverse reasoning scenarios, model families, and sizes. We analyze representation space, generated response and data scales, and reveal how question translation training strengthens language alignment within LLMs.
arXiv Detail & Related papers (2024-05-02T14:49:50Z)
Gl\'orIA - A Generative and Open Large Language Model for Portuguese [4.782288068552145]
We introduce Gl'orIA, a robust European Portuguese decoder LLM. To pre-train Gl'orIA, we assembled a comprehensive PT-PT text corpus comprising 35 billion tokens from various sources. Evaluation shows that Gl'orIA significantly outperforms existing open PT decoder models in language modeling.
arXiv Detail & Related papers (2024-02-20T12:36:40Z)
LLaMA Beyond English: An Empirical Study on Language Capability Transfer [49.298360366468934]
We focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language. We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer. We employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench.
arXiv Detail & Related papers (2024-01-02T06:29:02Z)
MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks [12.665447518524187]
This study aims to perform a thorough evaluation of the non-English capabilities of SoTA LLMs by comparing them on the same set of multilingual datasets. Our benchmark comprises 22 datasets covering 83 languages, including low-resource African languages. We also perform a study on data contamination and find that several models are likely to be contaminated with multilingual evaluation benchmarks.
arXiv Detail & Related papers (2023-11-13T16:45:37Z)
Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations [59.056367787688146]
This paper pioneers exploring and training powerful Multilingual Math Reasoning (xMR) LLMs. We construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages. By utilizing translation, we construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
arXiv Detail & Related papers (2023-10-31T08:09:20Z)
XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization [128.37244072182506]
Cross-lingual TRansfer Evaluation of Multilinguals XTREME is a benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models.
arXiv Detail & Related papers (2020-03-24T19:09:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.