Fugu-MT 論文翻訳(概要): MARCA: A Checklist-Based Benchmark for Multilingual Web Search

論文の概要: MARCA: A Checklist-Based Benchmark for Multilingual Web Search

arxiv url: http://arxiv.org/abs/2604.14448v1
Date: Wed, 15 Apr 2026 21:54:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-17 21:29:31.626466
Title: MARCA: A Checklist-Based Benchmark for Multilingual Web Search
Title（参考訳）: MARCA:多言語Web検索のためのチェックリストベースのベンチマーク
Authors: Thales Sales Almeida, Giovana Kerche Bonás, Ramon Pires, Celio Larcher, Hugo Abonizio, Marcos Piau, Roseval Malaquias Junior, Rodrigo Nogueira, Thiago Laitz,
Abstract要約: Web ベースの情報検索において,大規模言語モデル (LLM) を評価するベンチマークである textscMARCA を提案する。我々は、直接Web検索とスクレイピングを備えたベーシックフレームワークと、委譲サブエージェントによるタスクの分解を可能にするOrchestratorフレームワークの2つのインタラクション設定の下で、14のモデルを評価する。
参考スコア（独自算出の注目度）: 8.678622777553263
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly used as sources of information, yet their reliability depends on the ability to search the web, select relevant evidence, and synthesize complete answers. While recent benchmarks evaluate web-browsing and agentic tool use, multilingual settings, and Portuguese in particular, remain underexplored. We present \textsc{MARCA}, a bilingual (English and Portuguese) benchmark for evaluating LLMs on web-based information seeking. \textsc{MARCA} consists of 52 manually authored multi-entity questions, paired with manually validated checklist-style rubrics that explicitly measure answer completeness and correctness. We evaluate 14 models under two interaction settings: a Basic framework with direct web search and scraping, and an Orchestrator framework that enables task decomposition via delegated subagents. To capture stochasticity, each question is executed multiple times and performance is reported with run-level uncertainty. Across models, we observe large performance differences, find that orchestration often improves coverage, and identify substantial variability in how models transfer from English to Portuguese. The benchmark is available at https://github.com/maritaca-ai/MARCA
Abstract（参考訳）: 大規模言語モデル(LLM)は情報ソースとしてますます利用されているが、その信頼性はWebを検索し、関連する証拠を選択し、完全な答えを合成する能力に依存する。最近のベンチマークでは、Webブラウジングとエージェントツールの使用を評価しているが、多言語設定、特にポルトガル語は未調査のままである。 We present \textsc{MARCA}, a bilingual ( English and Portuguese) benchmark for a LLMs on web-based information seeking。 \textsc{MARCA} は52の質問を手動で作成し、手動で検証されたチェックリストスタイルのルーリックと組み合わせて答えの完全性と正しさを明示的に測定する。我々は、直接Web検索とスクレイピングを備えたベーシックフレームワークと、委譲サブエージェントによるタスクの分解を可能にするOrchestratorフレームワークの2つのインタラクション設定の下で、14のモデルを評価する。確率性を捉えるために、各質問は複数回実行され、実行レベルの不確実性によって性能が報告される。モデル全体では、大きなパフォーマンスの違いを観察し、オーケストレーションがカバレッジを改善することが少なく、モデルが英語からポルトガル語にどのように移行するかにおいて、かなりのバリエーションが特定できる。ベンチマークはhttps://github.com/maritaca-ai/MARCAで公開されている。

論文の概要: MARCA: A Checklist-Based Benchmark for Multilingual Web Search

関連論文リスト