Fugu-MT 論文翻訳(概要): Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models

論文の概要: Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models

arxiv url: http://arxiv.org/abs/2412.05939v1
Date: Sun, 08 Dec 2024 13:45:44 GMT
ステータス: 翻訳完了
システム内更新日: 2024-12-10 23:11:44.091797
Title: Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models
Title（参考訳）: マルチモーダル大言語モデルのための多言語概念アノテーションの探索
Authors: Xiao Xu, Tianhao Niu, Yuxi Xie, Libo Qin, Wanxiang Che, Min-Yen Kan,
Abstract要約: MLLMのためのMMGiCアノテーション(Multimodal Multi-Grained Concept Annotation)を特徴とする新しいデータセットを提案する。分析の結果,構造化テンプレートと汎用MLLMフレームワークの下で,多義的な概念アノテーションが相互に統合され,補完されることが明らかとなった。さらに,12のマルチモーダル理解および生成ベンチマークにおいて,MMGiCと画像キャプチャデータとの公正な比較と効果的な協調関係を検証し,我々の仮説を検証した。
参考スコア（独自算出の注目度）: 55.25892137362187
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models (MLLMs) excel in vision--language tasks by pre-training solely on coarse-grained concept annotations (e.g., image captions). We hypothesize that integrating fine-grained concept annotations (e.g., object labels and object regions) will further improve performance, as both data granularities complement each other in terms of breadth and depth in concept representation. We introduce a new dataset featuring Multimodal Multi-Grained Concept annotations (MMGiC) for MLLMs. In constructing MMGiC, we explore the impact of different data recipes on multimodal comprehension and generation. Our analyses reveal that multi-grained concept annotations integrate and complement each other, under our structured template and a general MLLM framework. We clearly explore and demonstrate the potential of MMGiC to help MLLMs better locate and learn concepts, aligning vision and language at multiple granularities. We further validate our hypothesis by investigating the fair comparison and effective collaboration between MMGiC and image--caption data on 12 multimodal comprehension and generation benchmarks, e.g., their appropriate combination achieve 3.95% and 2.34% absolute improvements over image--caption data alone on POPE and SEED-Bench. Code, data and models will be available at https://github.com/LooperXX/MMGiC.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は、粗い概念アノテーション(例:画像キャプション)のみを事前学習することにより、視覚言語タスクに優れる。より詳細な概念アノテーション(オブジェクトラベルやオブジェクト領域など)を統合することで、データの粒度が概念表現の幅と深さの両面を補完するので、パフォーマンスがさらに向上する、という仮説を立てる。 MLLMのためのMMGiCアノテーション(Multimodal Multi-Grained Concept Annotation)を特徴とする新しいデータセットを提案する。 MMGiCの構築において、異なるデータレシピがマルチモーダル理解と生成に与える影響について検討する。分析の結果,構造化テンプレートと汎用MLLMフレームワークの下で,多義的な概念アノテーションが相互に統合され,補完されることが明らかとなった。 MLLMが概念をよりよく見つけ、学習し、視覚と言語を複数の粒度で整列させるのに役立つMMGiCの可能性を探り、実証する。さらに,12のマルチモーダル理解と生成ベンチマークにおけるMMGiCと画像キャプチャデータとの公正な比較と効果的な協調関係,例えば,POPEとSEED-Benchのみを用いた画像キャプチャデータよりも3.95%と2.34%の絶対的な改善を達成できる,という仮説を検証した。コード、データ、モデルはhttps://github.com/LooperXX/MMGiC.comで入手できる。

論文の概要: Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models

関連論文リスト