BLUEX: A benchmark based on Brazilian Leading Universities Entrance
eXams
- URL: http://arxiv.org/abs/2307.05410v1
- Date: Tue, 11 Jul 2023 16:25:09 GMT
- Title: BLUEX: A benchmark based on Brazilian Leading Universities Entrance
eXams
- Authors: Thales Sales Almeida, Thiago Laitz, Giovana K. Bon\'as, Rodrigo
Nogueira
- Abstract summary: We introduce BLUEX, a dataset of entrance exams from the two leading universities in Brazil: UNI CAMP and USP.
The dataset includes annotated metadata for evaluating the performance of NLP models on a variety of subjects.
We establish a benchmark through experiments with state-of-the-art LMs, demonstrating its potential for advancing the state-of-the-art in natural language understanding and reasoning in Portuguese.
- Score: 4.9069311006119865
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: One common trend in recent studies of language models (LMs) is the use of
standardized tests for evaluation. However, despite being the fifth most spoken
language worldwide, few such evaluations have been conducted in Portuguese.
This is mainly due to the lack of high-quality datasets available to the
community for carrying out evaluations in Portuguese. To address this gap, we
introduce the Brazilian Leading Universities Entrance eXams (BLUEX), a dataset
of entrance exams from the two leading universities in Brazil: UNICAMP and USP.
The dataset includes annotated metadata for evaluating the performance of NLP
models on a variety of subjects. Furthermore, BLUEX includes a collection of
recently administered exams that are unlikely to be included in the training
data of many popular LMs as of 2023. The dataset is also annotated to indicate
the position of images in each question, providing a valuable resource for
advancing the state-of-the-art in multimodal language understanding and
reasoning. We describe the creation and characteristics of BLUEX and establish
a benchmark through experiments with state-of-the-art LMs, demonstrating its
potential for advancing the state-of-the-art in natural language understanding
and reasoning in Portuguese. The data and relevant code can be found at
https://github.com/Portuguese-Benchmark-Datasets/BLUEX
Related papers
- BEIR-NL: Zero-shot Information Retrieval Benchmark for the Dutch Language [3.3990813930813997]
We introduce BEIR-NL by automatically translating the publicly accessible BEIR datasets into Dutch.
We evaluate a wide range of multilingual dense ranking and reranking models, as well as the lexical BM25 method.
arXiv Detail & Related papers (2024-12-11T12:15:57Z) - P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
Large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning.
Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks.
We present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks.
We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.
arXiv Detail & Related papers (2024-11-14T01:29:36Z) - XNLIeu: a dataset for cross-lingual NLI in Basque [14.788692648660797]
In this paper, we expand XNLI to include Basque, a low-resource language that can greatly benefit from transfer-learning approaches.
The new dataset, dubbed XNLIeu, has been developed by first machine-translating the English XNLI corpus into Basque, followed by a manual post-edition step.
arXiv Detail & Related papers (2024-04-10T13:19:56Z) - Toward Informal Language Processing: Knowledge of Slang in Large Language Models [16.42982896928428]
We construct a dataset that supports evaluation on a diverse set of tasks pertaining to automatic processing of slang.
For both evaluation and finetuning, we show the effectiveness of our dataset on two core applications.
We find that while LLMs such as GPT-4 achieve good performance in a zero-shot setting, smaller BERT-like models finetuned on our dataset achieve comparable performance.
arXiv Detail & Related papers (2024-04-02T21:50:18Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectal datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - Introducing Bode: A Fine-Tuned Large Language Model for Portuguese
Prompt-Based Task [1.158680734110387]
This work proposes a fine-tuned LLaMA 2-based model for Portuguese prompts named Bode.
We evaluate the performance of this model in classification tasks using the zero-shot approach with in-context learning.
arXiv Detail & Related papers (2024-01-05T17:15:01Z) - Paloma: A Benchmark for Evaluating Language Model Fit [112.481957296585]
Evaluations of language models (LMs) commonly report perplexity on monolithic data held out from training.
We introduce Perplexity Analysis for Language Model Assessment (Paloma), a benchmark to measure LM fit to 546 English and code domains.
arXiv Detail & Related papers (2023-12-16T19:12:45Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Embedding generation for text classification of Brazilian Portuguese
user reviews: from bag-of-words to transformers [0.0]
This study includes from classical (Bag-of-Words) to state-of-the-art (Transformer-based) NLP models.
It aims to provide a comprehensive experimental study of embedding approaches targeting a binary sentiment classification of user reviews in Brazilian Portuguese.
arXiv Detail & Related papers (2022-12-01T15:24:19Z) - GEMv2: Multilingual NLG Benchmarking in a Single Line of Code [161.1761414080574]
Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers.
GEMv2 supports 40 documented datasets in 51 languages.
Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.
arXiv Detail & Related papers (2022-06-22T17:52:30Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.