Related papers: BLUEX: A benchmark based on Brazilian Leading Universities Entrance eXams

BLUEX: A benchmark based on Brazilian Leading Universities Entrance eXams

URL: http://arxiv.org/abs/2307.05410v1
Date: Tue, 11 Jul 2023 16:25:09 GMT
Title: BLUEX: A benchmark based on Brazilian Leading Universities Entrance eXams
Authors: Thales Sales Almeida, Thiago Laitz, Giovana K. Bon\'as, Rodrigo Nogueira
Abstract summary: We introduce BLUEX, a dataset of entrance exams from the two leading universities in Brazil: UNI CAMP and USP. The dataset includes annotated metadata for evaluating the performance of NLP models on a variety of subjects. We establish a benchmark through experiments with state-of-the-art LMs, demonstrating its potential for advancing the state-of-the-art in natural language understanding and reasoning in Portuguese.
Score: 4.9069311006119865
License: http://creativecommons.org/licenses/by/4.0/
Abstract: One common trend in recent studies of language models (LMs) is the use of standardized tests for evaluation. However, despite being the fifth most spoken language worldwide, few such evaluations have been conducted in Portuguese. This is mainly due to the lack of high-quality datasets available to the community for carrying out evaluations in Portuguese. To address this gap, we introduce the Brazilian Leading Universities Entrance eXams (BLUEX), a dataset of entrance exams from the two leading universities in Brazil: UNICAMP and USP. The dataset includes annotated metadata for evaluating the performance of NLP models on a variety of subjects. Furthermore, BLUEX includes a collection of recently administered exams that are unlikely to be included in the training data of many popular LMs as of 2023. The dataset is also annotated to indicate the position of images in each question, providing a valuable resource for advancing the state-of-the-art in multimodal language understanding and reasoning. We describe the creation and characteristics of BLUEX and establish a benchmark through experiments with state-of-the-art LMs, demonstrating its potential for advancing the state-of-the-art in natural language understanding and reasoning in Portuguese. The data and relevant code can be found at https://github.com/Portuguese-Benchmark-Datasets/BLUEX

Related papers

BEIR-NL: Zero-shot Information Retrieval Benchmark for the Dutch Language [3.3990813930813997]
We introduce BEIR-NL by automatically translating the publicly accessible BEIR datasets into Dutch. We evaluate a wide range of multilingual dense ranking and reranking models, as well as the lexical BM25 method.
arXiv Detail & Related papers (2024-12-11T12:15:57Z)
P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
Large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning. Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks. We present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks. We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.
arXiv Detail & Related papers (2024-11-14T01:29:36Z)
Tucano: Advancing Neural Text Generation for Portuguese [0.0]
This study aims to introduce a new set of resources to stimulate the future development of neural text generation in Portuguese. In this work, we document the development of GigaVerbo, a concatenation of deduplicated Portuguese text corpora amounting to 200 billion tokens. Our models perform equal or superior to other Portuguese and multilingual language models of similar size in several Portuguese benchmarks.
arXiv Detail & Related papers (2024-11-12T15:06:06Z)
XNLIeu: a dataset for cross-lingual NLI in Basque [14.788692648660797]
In this paper, we expand XNLI to include Basque, a low-resource language that can greatly benefit from transfer-learning approaches. The new dataset, dubbed XNLIeu, has been developed by first machine-translating the English XNLI corpus into Basque, followed by a manual post-edition step.
arXiv Detail & Related papers (2024-04-10T13:19:56Z)
Toward Informal Language Processing: Knowledge of Slang in Large Language Models [16.42982896928428]
We construct a dataset that supports evaluation on a diverse set of tasks pertaining to automatic processing of slang. For both evaluation and finetuning, we show the effectiveness of our dataset on two core applications. We find that while LLMs such as GPT-4 achieve good performance in a zero-shot setting, smaller BERT-like models finetuned on our dataset achieve comparable performance.
arXiv Detail & Related papers (2024-04-02T21:50:18Z)
Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets. This survey delves into an important attribute of these datasets: the dialect of a language. Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z)
Introducing Bode: A Fine-Tuned Large Language Model for Portuguese Prompt-Based Task [1.158680734110387]
This work proposes a fine-tuned LLaMA 2-based model for Portuguese prompts named Bode. We evaluate the performance of this model in classification tasks using the zero-shot approach with in-context learning.
arXiv Detail & Related papers (2024-01-05T17:15:01Z)
Paloma: A Benchmark for Evaluating Language Model Fit [112.481957296585]
Evaluations of language models (LMs) commonly report perplexity on monolithic data held out from training. We introduce Perplexity Analysis for Language Model Assessment (Paloma), a benchmark to measure LM fit to 546 English and code domains.
arXiv Detail & Related papers (2023-12-16T19:12:45Z)
The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants. This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z)
Embedding generation for text classification of Brazilian Portuguese user reviews: from bag-of-words to transformers [0.0]
This study includes from classical (Bag-of-Words) to state-of-the-art (Transformer-based) NLP models. It aims to provide a comprehensive experimental study of embedding approaches targeting a binary sentiment classification of user reviews in Brazilian Portuguese.
arXiv Detail & Related papers (2022-12-01T15:24:19Z)
GEMv2: Multilingual NLG Benchmarking in a Single Line of Code [161.1761414080574]
Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers. GEMv2 supports 40 documented datasets in 51 languages. Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.
arXiv Detail & Related papers (2022-06-22T17:52:30Z)
IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark. IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages. We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z)
AfroMT: Pretraining Strategies and Reproducible Benchmarks for Translation of 8 African Languages [94.75849612191546]
AfroMT is a standardized, clean, and reproducible machine translation benchmark for eight widely spoken African languages. We develop a suite of analysis tools for system diagnosis taking into account the unique properties of these languages. We demonstrate significant improvements when pretraining on 11 languages, with gains of up to 2 BLEU points over strong baselines.
arXiv Detail & Related papers (2021-09-10T07:45:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.