Fugu-MT 論文翻訳(概要): Large Multimodal Models as General In-Context Classifiers

論文の概要: Large Multimodal Models as General In-Context Classifiers

arxiv url: http://arxiv.org/abs/2602.23229v1
Date: Thu, 26 Feb 2026 17:08:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-27 18:41:22.804039
Title: Large Multimodal Models as General In-Context Classifiers
Title（参考訳）: 一般文脈分類器としての大規模マルチモーダルモデル
Authors: Marco Garosi, Matteo Farina, Alessandro Conti, Massimiliano Mancini, Elisa Ricci,
Abstract要約: 本稿では,この回答がLMMの重要な能力,すなわちコンテキスト内学習を見落としていることを論じる。我々は、クローズドワールド分類のための多様なデータセットに関する最先端のLMMをベンチマークし、そのゼロショット性能はCLIPよりも低いが、いくつかのインコンテキスト例を持つLMMは、キャッシュベースのアダプタと対照的なVLMをマッチまたは超える可能性があることを発見した。この分析をオープンワールド設定に拡張し,LMMの生成特性をタスクに適したものにする。
参考スコア（独自算出の注目度）: 73.11242790834383
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Which multimodal model should we use for classification? Previous studies suggest that the answer lies in CLIP-like contrastive Vision-Language Models (VLMs), due to their remarkable performance in zero-shot classification. In contrast, Large Multimodal Models (LMM) are more suitable for complex tasks. In this work, we argue that this answer overlooks an important capability of LMMs: in-context learning. We benchmark state-of-the-art LMMs on diverse datasets for closed-world classification and find that, although their zero-shot performance is lower than CLIP's, LMMs with a few in-context examples can match or even surpass contrastive VLMs with cache-based adapters, their "in-context" equivalent. We extend this analysis to the open-world setting, where the generative nature of LMMs makes them more suitable for the task. In this challenging scenario, LMMs struggle whenever provided with imperfect context information. To address this issue, we propose CIRCLE, a simple training-free method that assigns pseudo-labels to in-context examples, iteratively refining them with the available context itself. Through extensive experiments, we show that CIRCLE establishes a robust baseline for open-world classification, surpassing VLM counterparts and highlighting the potential of LMMs to serve as unified classifiers, and a flexible alternative to specialized models.
Abstract（参考訳）: 分類にどのマルチモーダルモデルを使うべきか? これまでの研究では、この答えは、ゼロショット分類における顕著な性能のため、CLIPのような対照的な視覚言語モデル(VLM)にあることが示唆されている。対照的に、LMM(Large Multimodal Models)は複雑なタスクに適している。本稿では,この回答がLMMの重要な能力,すなわちコンテキスト内学習を見落としていることを論じる。我々は、クローズドワールド分類のための多様なデータセットに関する最先端のLMMをベンチマークし、そのゼロショット性能はCLIPよりも低いが、いくつかのインコンテキスト例を持つLMMは、キャッシュベースのアダプタと対照的なVLMと、同等の"インコンテクスト"である"キャッシュベースのアダプタとをマッチまたは超える可能性があることを発見した。この分析をオープンワールド設定に拡張し,LMMの生成特性をタスクに適したものにする。この挑戦的なシナリオでは、LMMは不完全なコンテキスト情報を提供するたびに苦労する。この問題に対処するために、CIRCLEを提案する。CIRCLEは、擬似ラベルをインコンテキストの例に割り当て、利用可能なコンテキスト自体で繰り返し修正する単純なトレーニング不要の手法である。広範にわたる実験により、CIRCLEは、オープンワールド分類のための堅牢なベースラインを確立し、VLM分類を超越し、LMMが統一分類器として機能し、特殊モデルの柔軟な代替となる可能性を強調している。

論文の概要: Large Multimodal Models as General In-Context Classifiers

関連論文リスト