Fugu-MT 論文翻訳(概要): FindIt: A Format-Informed Visual Detection Benchmark for Generalist Multimodal LLMs

論文の概要: FindIt: A Format-Informed Visual Detection Benchmark for Generalist Multimodal LLMs

arxiv url: http://arxiv.org/abs/2606.04282v1
Date: Tue, 02 Jun 2026 23:14:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-04 20:44:18.424025
Title: FindIt: A Format-Informed Visual Detection Benchmark for Generalist Multimodal LLMs
Title（参考訳）: FindIt: 汎用マルチモーダルLCMのためのフォーマットインフォームドビジュアル検出ベンチマーク
Authors: Eshika Khandelwal, Jingjing Pan, Mingfang Zhang, Quan Kong, Lorenzo Garattoni, Hilde Kuehne,
Abstract要約: 本稿では,ジェネラリストMLLMの迅速なローカライゼーション能力を評価するために設計された,最初の包括的なベンチマークを紹介する。我々のベンチマークは、オブジェクト検出、参照式検出、インスタンスレベルの検出、ビデオベースの検出の4つの中核的なタスクカテゴリにまたがっている。オープンソースとプロプライエタリなMLLMの多種多様なセットを評価し,その性能と限界を詳細に分析する。
参考スコア（独自算出の注目度）: 37.64883536754805
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Multimodal large language models (MLLMs) are predominantly evaluated on free-form vision-language tasks such as visual question answering, captioning, and summarization. However, their practical use is rapidly expanding to more structured computer vision settings, where users prompt models to perform localization-centric tasks such as object detection, often within larger agentic or decision-making systems. Despite this shift, there is currently no standardized benchmark that systematically evaluates these capabilities at scale. In this work, we introduce the first comprehensive benchmark specifically designed to assess the promptable localization abilities of generalist MLLMs. Our benchmark spans four core task categories: object detection, referring expression detection, instance-level detection, and video-based detection. To enable consistent and fair evaluation, we develop a unified framework that standardizes inputs, enforces parsable bounding box outputs, and defines transparent evaluation protocols across tasks. Using this suite, we evaluate a diverse set of open-source and proprietary MLLMs, providing an in-depth analysis of their performance and limitations. Beyond accuracy, we examine models' ability to adhere to output format specifications, showing that current systems are highly sensitive to formatting constraints and often fail to generalize even to minor variations. Our results highlight both the strengths and shortcomings of state-of-the-art MLLMs in localization settings, and point toward important directions for improving multimodal model design and evaluation.
Abstract（参考訳）: マルチモーダル大言語モデル (MLLM) は視覚的質問応答、キャプション、要約などの自由形式の視覚言語タスクにおいて主に評価される。しかし、その実用的利用は急速に構造化されたコンピュータビジョン設定へと拡大し、ユーザーはより大きなエージェントや意思決定システム内で、オブジェクト検出などのローカライズ中心のタスクをモデルに実行するよう促している。この移行にもかかわらず、これらの機能を大規模に体系的に評価する標準ベンチマークは今のところ存在しない。本研究では,ジェネラリストMLLMの迅速なローカライゼーション能力を評価するために設計された,最初の包括的なベンチマークを紹介する。我々のベンチマークは、オブジェクト検出、参照式検出、インスタンスレベルの検出、ビデオベースの検出の4つの中核的なタスクカテゴリにまたがっている。一貫性と公正な評価を可能にするため,入力を標準化し,解析可能なバウンディングボックス出力を適用し,タスク間の透過的な評価プロトコルを定義する統一フレームワークを開発した。このスイートを用いて、さまざまなオープンソースおよびプロプライエタリなMLLMを評価し、それらの性能と限界を詳細に分析する。精度を超えて、出力フォーマット仕様に準拠するモデルの能力を検証し、現在のシステムはフォーマット制約に非常に敏感であり、小さなバリエーションであっても一般化に失敗することが多いことを示す。本研究は, ローカライズ設定における最先端MLLMの長所と短所を両立させ, マルチモーダルモデル設計・評価における重要な方向性を示すものである。

論文の概要: FindIt: A Format-Informed Visual Detection Benchmark for Generalist Multimodal LLMs

関連論文リスト