Fugu-MT 論文翻訳(概要): CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

論文の概要: CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

arxiv url: http://arxiv.org/abs/2604.22498v1
Date: Fri, 24 Apr 2026 12:26:49 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-27 15:36:26.451386
Title: CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding
Title（参考訳）: CGC:微粒化マルチイメージ理解のための合成接地コントラスト
Authors: Lihao Zheng, Zhenwei Shao, Yu Zhou, Yan Yang, Xintian Shen, Jiawei Chen, Hao Ma, Tao Wei,
Abstract要約: MLLMの微細なマルチイメージ理解を促進するための,低コストなフルフレームワークであるComposeal Grounded Contrast (CGC)を提案する。 CGCは、イントラ画像コントラストとイントラ画像コントラストを通じて、コンストラクショナルなマルチイメージトレーニングインスタンスを構築する。 CGC は MIG-Bench や VLM2-Bench などの細粒度マルチイメージのベンチマークで最先端の結果を得る。
参考スコア（独自算出の注目度）: 15.821484459549369
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although Multimodal Large Language Models (MLLMs) have advanced rapidly, they still face notable challenges in fine-grained multi-image understanding, often exhibiting spatial hallucination, attention leakage, and failures in object constancy. In addition, existing approaches typically rely on expensive human annotations or large-scale chain-of-thought (CoT) data generation. We propose Compositional Grounded Contrast (abbr. CGC), a low-cost full framework for boosting fine-grained multi-image understanding of MLLMs. Built on existing single-image grounding annotations, CGC constructs compositional multi-image training instances through Inter-Image Contrast and Intra-Image Contrast, which introduce semantically decoupled distractor contexts for cross-image discrimination and correlated cross-view samples for object constancy, respectively. CGC further introduces a Rule-Based Spatial Reward within the GRPO framework to improve source-image attribution, spatial alignment, and structured output validity under a Think-before-Grounding paradigm. Experiments show that CGC achieves state-of-the-art results on fine-grained multi-image benchmarks, including MIG-Bench and VLM2-Bench. The learned multi-image understanding capability also transfers to broader multimodal understanding and reasoning tasks, yielding consistent gains over the Qwen3-VL-8B base model on MathVista (+2.90), MuirBench (+2.88), MMStar (+1.93), MMMU (+1.77), and BLINK (+1.69).
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は急速に進歩しているが、空間幻覚、注意漏れ、オブジェクトの一貫性の失敗など、細粒度のマルチイメージ理解において注目すべき課題に直面している。加えて、既存のアプローチは一般的に高価な人間のアノテーションや大規模なCoTデータ生成に依存します。 MLLMの微細なマルチイメージ理解を促進するための,低コストなフルフレームワークであるComposeal Grounded Contrast (CGC)を提案する。既存の単一イメージの基底アノテーションに基づいて構築されたCGCは、画像間コントラストと画像内コントラストを通じて構成的なマルチイメージトレーニングインスタンスを構築し、画像間識別のための意味的に分離されたトラクタコンテキストと、オブジェクトの一貫性のための相関したクロスビューサンプルをそれぞれ導入する。 CGCはさらに、GRPOフレームワーク内でルールベースの空間リワードを導入し、Think-before-Groundingパラダイムの下で、ソースイメージの属性、空間アライメント、構造化された出力妥当性を改善する。実験の結果、CGCはMIG-BenchやVLM2-Benchを含む細粒度マルチイメージのベンチマークにおいて、最先端の結果が得られることがわかった。学習されたマルチイメージ理解能力は、より広範なマルチモーダル理解と推論タスクに移行し、MathVista (+2.90), MuirBench (+2.88), MMStar (+1.93), MMMU (+1.77), BLINK (+1.69), 上のQwen3-VL-8Bベースモデルに対して一貫した利得をもたらす。

論文の概要: CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

関連論文リスト