Fugu-MT 論文翻訳(概要): BLUEX v2: Benchmarking LLMs on Open-Ended Questions from Brazilian University Entrance Exams

論文の概要: BLUEX v2: Benchmarking LLMs on Open-Ended Questions from Brazilian University Entrance Exams

arxiv url: http://arxiv.org/abs/2606.22723v1
Date: Sun, 21 Jun 2026 23:45:49 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-25 05:06:01.831462
Title: BLUEX v2: Benchmarking LLMs on Open-Ended Questions from Brazilian University Entrance Exams
Title（参考訳）: BLUEX v2: ブラジル大学入学試験におけるオープンエンディング質問のLCMのベンチマーク
Authors: João Guilherme Alves Santos, Giovana Kerche Bonás, Thiago Laitz, Thales Sales Almeida, Helio Pedrini,
Abstract要約: ブラジルの2つの主要な大学の第2段階の入学試験から得られたベンチマークであるBLUEX v2を紹介する。我々のデータセットは、395の質問を919のグレードのサブクエストに展開し、55.7%の質問が関連画像を含んでいる。その結果、モデルにまたがる4.92ポイントのパフォーマンスが明らかになり、数学的推論と画像理解が最も難しい能力の次元として現れる。
参考スコア（独自算出の注目度）: 5.232617124162657
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Although Large Language Models (LLMs) excel in many tasks, their assessment in Portuguese has received less attention, particularly for open-ended, discursive tasks that demand deeper reasoning and generation capabilities. While the original BLUEX benchmark addressed the scarcity of Portuguese evaluation datasets through multiple-choice questions from Brazilian university entrance exams, it did not cover the more challenging second-phase examinations, which require free-form written responses. In this work, we introduce BLUEX v2, a benchmark derived from the second-phase entrance exams of Brazil's two leading universities: UNICAMP (Comvest) and USP (Fuvest), spanning exam years 2022-2025. Our dataset comprises 395 questions unfolding into 919 graded subquestions, with 55.7% of questions containing associated images. Each question is annotated with subject area, official reference answers, LLM-generated rubric criteria, and six cognitive capability tags. We evaluate 21 state-of-the-art LLMs using an LLM-as-a-judge protocol. Results reveal a 4.92-point performance spread across models (4.18-9.10 on a 0-10 scale), with Mathematical Reasoning and Image Understanding emerging as the hardest capability dimensions. The dataset, evaluation code, and model outputs are publicly available at https://anonymous.4open.science/r/BLUEXv2.
Abstract（参考訳）: 大きな言語モデル(LLM)は多くのタスクに優れていますが、ポルトガルでの彼らの評価は、特により深い推論と生成能力を必要とするオープンエンドの非帰的なタスクに対して、あまり注目されていません。元々のBLUEXベンチマークは、ブラジルの大学入学試験からの複数項目の質問を通じてポルトガルの評価データセットの不足に対処したが、自由形式の回答を必要とする、より困難な第二段階の試験はカバーしなかった。本研究では,ブラジルの2大大学(UNICAMP (Comvest) とUSP (Fuvest) の2段階の入学試験から得られたベンチマークである BLUEX v2 を紹介する。我々のデータセットは、395の質問を919のグレードのサブクエストに展開し、55.7%の質問が関連画像を含んでいる。各質問には、主題領域、公式の参照回答、LCMの生成基準、および6つの認知能力タグが注釈付けされている。 LLM-as-a-judgeプロトコルを用いて21の最先端LCMを評価した。その結果、モデルにまたがる4.92ポイントのパフォーマンス(0-10スケールで4.18-9.10)が示され、数学的推論と画像理解が最も難しい能力の次元として現れる。データセット、評価コード、モデル出力はhttps://anonymous.4open.science/r/BLUEXv2で公開されている。

論文の概要: BLUEX v2: Benchmarking LLMs on Open-Ended Questions from Brazilian University Entrance Exams

関連論文リスト