Fugu-MT 論文翻訳(概要): SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring

論文の概要: SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring

arxiv url: http://arxiv.org/abs/2604.25855v1
Date: Tue, 28 Apr 2026 16:57:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-29 16:49:17.963358
Title: SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring
Title（参考訳）: SIEVES:視覚的エビデンス・スコアリングによる選択予測の一般化
Authors: Hector G. Rodriguez, Marcus Rohrbach,
Abstract要約: MLLM(Multimodal large language model)は、視覚言語タスクにおける絶え間ないパフォーマンスを実現する。選択予測は,ユーザ定義のリスクレベルに順応しながら,システム回答の入力のシェアを向上することを目的としている。信頼性の高い一般化を実現するためには,解答中に局所的な視覚的エビデンスを生成するための推論モデルと,解答者が提供する局所化の質を明示的に推定するセレクタを設計する必要がある。
参考スコア（独自算出の注目度）: 9.116950360800246
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal large language models (MLLMs) achieve ever-stronger performance on visual-language tasks. Even as traditional visual question answering benchmarks approach saturation, reliable deployment requires satisfying low error tolerances in real-world out-of-distribution (OOD) scenarios. Precisely, selective prediction aims to improve coverage, i.e. the share of inputs the system answers, while adhering to a user-defined risk level. This is typically achieved by assigning a confidence score to each answer and abstaining on those that fall below a certain threshold. To enable reliable generalization, we require reasoner models to produce localized visual evidence while answering, and design a selector that explicitly learns to estimate the quality of the localization provided by the reasoner. We show that SIEVES (Selective Prediction through Visual Evidence Scoring) improves coverage by up to three times on challenging OOD benchmarks (V* Bench, HR-Bench-8k, MME-RealWorld-Lite, VizWiz, and AdVQA), compared to non-grounding baselines. Beyond better generalization to OOD tasks, the design of the SIEVES selector enables transfer to proprietary reasoners without access to their weights or logits, such as o3 and Gemini-3-Pro, providing coverage boosts beyond those attributable to accuracy alone. We highlight that SIEVES generalizes across all five tested OOD datasets and reasoner models (Pixel-Reasoner, o3, and Gemini-3-Pro), without benchmark- or reasoner-specific training or adaptation.
Abstract（参考訳）: MLLM(Multimodal large language model)は、視覚言語タスクにおける絶え間ないパフォーマンスを実現する。従来の視覚的質問応答ベンチマークが飽和に近づいたとしても、信頼性の高いデプロイメントでは、現実世界のアウト・オブ・ディストリビューション(OOD)シナリオで低いエラー耐性を満足する必要がある。正確には、選択的な予測は、ユーザ定義のリスクレベルに固執しながら、システム回答のインプットのシェアを向上することを目的としている。これは典型的には、信頼スコアを各回答に割り当て、一定の閾値未満の回答を控えることによって達成される。信頼性の高い一般化を実現するためには,解答中に局所的な視覚的エビデンスを生成するための推論モデルと,解答者が提供する局所化の質を明示的に推定するセレクタを設計する必要がある。 SIEVES(Selective Prediction through Visual Evidence Scoring)は,OODベンチマーク(V* Bench, HR-Bench-8k, MME-RealWorld-Lite, VizWiz, AdVQA)において,非接地ベースラインと比較して,最大3倍のカバレッジ向上を実現している。 OODタスクのさらなる一般化に加えて、SIEVESセレクタの設計により、O3やGemini-3-Proのような重みやロジットにアクセスせずにプロプライエタリな推論子への転送が可能となり、精度のみに起因するもの以上のカバレッジ向上が実現された。 SIEVESはベンチマークや推論固有のトレーニングや適応なしに、テスト済みのOODデータセットと推論モデル(Pixel-Reasoner、o3、Gemini-3-Pro)のすべてにまたがって一般化されている点を強調します。

論文の概要: SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring

関連論文リスト