Fugu-MT 論文翻訳(概要): VAT-KG: Knowledge-Intensive Multimodal Knowledge Graph Dataset for Retrieval-Augmented Generation

論文の概要: VAT-KG: Knowledge-Intensive Multimodal Knowledge Graph Dataset for Retrieval-Augmented Generation

arxiv url: http://arxiv.org/abs/2506.21556v1
Date: Wed, 11 Jun 2025 07:22:57 GMT
ステータス: 翻訳完了
システム内更新日: 2025-07-07 02:47:44.26213
Title: VAT-KG: Knowledge-Intensive Multimodal Knowledge Graph Dataset for Retrieval-Augmented Generation
Title（参考訳）: VAT-KG:検索型生成のための知識集約型マルチモーダル知識グラフデータセット
Authors: Hyeongcheol Park, MinHyuk Jang, Ha Dam Baek, Gyusam Chang, Jiyoung Seo, Jiwan Park, Hogun Park, Sangpil Kim,
Abstract要約: 視覚情報,音声情報,テキスト情報を網羅する,概念中心の知識集約型マルチモーダル知識グラフを提案する。構築パイプラインは,マルチモーダルデータと細粒度セマンティクスの相互知識アライメントを保証する。本稿では,任意のモダリティからクエリに応答して,概念レベルの詳細な知識を検索する,新しいマルチモーダルRAGフレームワークを提案する。
参考スコア（独自算出の注目度）: 3.1033038923749774
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Multimodal Knowledge Graphs (MMKGs), which represent explicit knowledge across multiple modalities, play a pivotal role by complementing the implicit knowledge of Multimodal Large Language Models (MLLMs) and enabling more grounded reasoning via Retrieval Augmented Generation (RAG). However, existing MMKGs are generally limited in scope: they are often constructed by augmenting pre-existing knowledge graphs, which restricts their knowledge, resulting in outdated or incomplete knowledge coverage, and they often support only a narrow range of modalities, such as text and visual information. These limitations reduce their extensibility and applicability to a broad range of multimodal tasks, particularly as the field shifts toward richer modalities such as video and audio in recent MLLMs. Therefore, we propose the Visual-Audio-Text Knowledge Graph (VAT-KG), the first concept-centric and knowledge-intensive multimodal knowledge graph that covers visual, audio, and text information, where each triplet is linked to multimodal data and enriched with detailed descriptions of concepts. Specifically, our construction pipeline ensures cross-modal knowledge alignment between multimodal data and fine-grained semantics through a series of stringent filtering and alignment steps, enabling the automatic generation of MMKGs from any multimodal dataset. We further introduce a novel multimodal RAG framework that retrieves detailed concept-level knowledge in response to queries from arbitrary modalities. Experiments on question answering tasks across various modalities demonstrate the effectiveness of VAT-KG in supporting MLLMs, highlighting its practical value in unifying and leveraging multimodal knowledge.
Abstract（参考訳）: マルチモーダル・ナレッジグラフ(MMKG)は、複数のモダリティにまたがる明示的な知識を表現するもので、MLLM(Multimodal Large Language Models)の暗黙的な知識を補完し、Retrieval Augmented Generation(RAG)を介してより基礎的な推論を可能にする。しかし、既存のMMKGは一般的に範囲が限られており、既存の知識グラフを拡張して知識を制限し、時代遅れまたは不完全な知識カバレッジをもたらすことがあり、テキストや視覚情報のような限られた範囲のモダリティしかサポートしないことが多い。これらの制限は、特に最近のMLLMにおけるビデオやオーディオのようなよりリッチなモダリティへのフィールドシフトによって、幅広いマルチモーダルタスクへの拡張性と適用性を低下させる。そこで本研究では,視覚,音声,テキスト情報を網羅する,概念中心かつ知識集約型のマルチモーダル知識グラフであるVisual-Audio-Text Knowledge Graph (VAT-KG)を提案する。具体的には,マルチモーダルデータと細粒度セマンティクスの相互知識アライメントを一連の文字列フィルタリングとアライメントステップによって保証し,任意のマルチモーダルデータセットからMMKGの自動生成を可能にする。さらに、任意のモダリティからのクエリに応答して詳細な概念レベルの知識を検索する、新しいマルチモーダルRAGフレームワークを導入する。様々なモダリティにまたがる質問応答タスクの実験は、マルチモーダル知識の統合と活用における実用的価値を強調し、MLLMをサポートする上でのVAT-KGの有効性を示す。

論文の概要: VAT-KG: Knowledge-Intensive Multimodal Knowledge Graph Dataset for Retrieval-Augmented Generation

関連論文リスト