BLUEX: A benchmark based on Brazilian Leading Universities Entrance
eXams
- URL: http://arxiv.org/abs/2307.05410v1
- Date: Tue, 11 Jul 2023 16:25:09 GMT
- Title: BLUEX: A benchmark based on Brazilian Leading Universities Entrance
eXams
- Authors: Thales Sales Almeida, Thiago Laitz, Giovana K. Bon\'as, Rodrigo
Nogueira
- Abstract summary: We introduce BLUEX, a dataset of entrance exams from the two leading universities in Brazil: UNI CAMP and USP.
The dataset includes annotated metadata for evaluating the performance of NLP models on a variety of subjects.
We establish a benchmark through experiments with state-of-the-art LMs, demonstrating its potential for advancing the state-of-the-art in natural language understanding and reasoning in Portuguese.
- Score: 4.9069311006119865
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: One common trend in recent studies of language models (LMs) is the use of
standardized tests for evaluation. However, despite being the fifth most spoken
language worldwide, few such evaluations have been conducted in Portuguese.
This is mainly due to the lack of high-quality datasets available to the
community for carrying out evaluations in Portuguese. To address this gap, we
introduce the Brazilian Leading Universities Entrance eXams (BLUEX), a dataset
of entrance exams from the two leading universities in Brazil: UNICAMP and USP.
The dataset includes annotated metadata for evaluating the performance of NLP
models on a variety of subjects. Furthermore, BLUEX includes a collection of
recently administered exams that are unlikely to be included in the training
data of many popular LMs as of 2023. The dataset is also annotated to indicate
the position of images in each question, providing a valuable resource for
advancing the state-of-the-art in multimodal language understanding and
reasoning. We describe the creation and characteristics of BLUEX and establish
a benchmark through experiments with state-of-the-art LMs, demonstrating its
potential for advancing the state-of-the-art in natural language understanding
and reasoning in Portuguese. The data and relevant code can be found at
https://github.com/Portuguese-Benchmark-Datasets/BLUEX
Related papers
- XNLIeu: a dataset for cross-lingual NLI in Basque [14.788692648660797]
In this paper, we expand XNLI to include Basque, a low-resource language that can greatly benefit from transfer-learning approaches.
The new dataset, dubbed XNLIeu, has been developed by first machine-translating the English XNLI corpus into Basque, followed by a manual post-edition step.
arXiv Detail & Related papers (2024-04-10T13:19:56Z) - Toward Informal Language Processing: Knowledge of Slang in Large Language Models [16.42982896928428]
We construct a dataset that supports evaluation on a diverse set of tasks pertaining to automatic processing of slang.
For both evaluation and finetuning, we show the effectiveness of our dataset on two core applications.
We find that while LLMs such as GPT-4 achieve good performance in a zero-shot setting, smaller BERT-like models finetuned on our dataset achieve comparable performance.
arXiv Detail & Related papers (2024-04-02T21:50:18Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - Introducing Bode: A Fine-Tuned Large Language Model for Portuguese
Prompt-Based Task [1.158680734110387]
This work proposes a fine-tuned LLaMA 2-based model for Portuguese prompts named Bode.
We evaluate the performance of this model in classification tasks using the zero-shot approach with in-context learning.
arXiv Detail & Related papers (2024-01-05T17:15:01Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - Embedding generation for text classification of Brazilian Portuguese
user reviews: from bag-of-words to transformers [0.0]
This study includes from classical (Bag-of-Words) to state-of-the-art (Transformer-based) NLP models.
It aims to provide a comprehensive experimental study of embedding approaches targeting a binary sentiment classification of user reviews in Brazilian Portuguese.
arXiv Detail & Related papers (2022-12-01T15:24:19Z) - GEMv2: Multilingual NLG Benchmarking in a Single Line of Code [161.1761414080574]
Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers.
GEMv2 supports 40 documented datasets in 51 languages.
Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.
arXiv Detail & Related papers (2022-06-22T17:52:30Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - LaoPLM: Pre-trained Language Models for Lao [3.2146309563776416]
Pre-trained language models (PLMs) can capture different levels of concepts in context and hence generate universal language representations.
Although PTMs have been widely used in most NLP applications, it is under-represented in Lao NLP research.
We construct a text classification dataset to alleviate the resource-scare situation of the Lao language.
We present the first transformer-based PTMs for Lao with four versions: BERT-small, BERT-base, ELECTRA-small and ELECTRA-base, and evaluate it over two downstream tasks: part-of-speech tagging and text classification.
arXiv Detail & Related papers (2021-10-12T11:13:07Z) - AfroMT: Pretraining Strategies and Reproducible Benchmarks for
Translation of 8 African Languages [94.75849612191546]
AfroMT is a standardized, clean, and reproducible machine translation benchmark for eight widely spoken African languages.
We develop a suite of analysis tools for system diagnosis taking into account the unique properties of these languages.
We demonstrate significant improvements when pretraining on 11 languages, with gains of up to 2 BLEU points over strong baselines.
arXiv Detail & Related papers (2021-09-10T07:45:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.