Fugu-MT 論文翻訳(概要): Relevance-aware Multi-context Contrastive Decoding for Retrieval-augmented Visual Question Answering

論文の概要: Relevance-aware Multi-context Contrastive Decoding for Retrieval-augmented Visual Question Answering

arxiv url: http://arxiv.org/abs/2602.06050v1
Date: Wed, 14 Jan 2026 04:08:31 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-15 14:54:53.661537
Title: Relevance-aware Multi-context Contrastive Decoding for Retrieval-augmented Visual Question Answering
Title（参考訳）: Relevance-Aware Multi-Context Contrastive Decoding for Retrieval-augmented Visual Question Answering
Authors: Jongha Kim, Byungoh Ko, Jeehye Na, Jinsung Yoon, Hyunwoo J. Kim,
Abstract要約: Relevance-aware Multi-Context Contrastive Decoding (RMCD)はRAGの新しい復号法である。 RMCDは、予測された出力を各文脈と組み合わせることで最終的な予測を出力し、各出力はその問題と関連性に基づいて重み付けされる。実験により、RMCDは複数のLVLMにおいて、他の復号法よりも一貫して優れていることが示された。
参考スコア（独自算出の注目度）: 37.441396497173265
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite the remarkable capabilities of Large Vision Language Models (LVLMs), they still lack detailed knowledge about specific entities. Retrieval-augmented Generation (RAG) is a widely adopted solution that enhances LVLMs by providing additional contexts from an external Knowledge Base. However, we observe that previous decoding methods for RAG are sub-optimal as they fail to sufficiently leverage multiple relevant contexts and suppress the negative effects of irrelevant contexts. To this end, we propose Relevance-aware Multi-context Contrastive Decoding (RMCD), a novel decoding method for RAG. RMCD outputs a final prediction by combining outputs predicted with each context, where each output is weighted based on its relevance to the question. By doing so, RMCD effectively aggregates useful information from multiple relevant contexts while also counteracting the negative effects of irrelevant ones. Experiments show that RMCD consistently outperforms other decoding methods across multiple LVLMs, achieving the best performance on three knowledge-intensive visual question-answering benchmarks. Also, RMCD can be simply applied by replacing the decoding method of LVLMs without additional training. Analyses also show that RMCD is robust to the retrieval results, consistently performing the best across the weakest to the strongest retrieval results. Code is available at https://github.com/mlvlab/RMCD.
Abstract（参考訳）: LVLM(Large Vision Language Models)の際立った能力にもかかわらず、特定のエンティティに関する詳細な知識はいまだに欠如している。 Retrieval-augmented Generation (RAG)は、外部知識ベースから追加のコンテキストを提供することでLVLMを強化する広く採用されているソリューションである。しかし,従来のRAGの復号法は,複数のコンテキストを十分に活用できず,無関係なコンテキストの負の効果を抑えられなかったため,準最適であることがわかった。そこで本研究では,RAGの新しい復号法であるRMCD(Relevance-aware Multi-Context Contrastive Decoding)を提案する。 RMCDは、予測された出力を各文脈と組み合わせることで最終的な予測を出力し、各出力はその問題と関連性に基づいて重み付けされる。これにより、RMCDは複数の関連するコンテキストから有用な情報を効果的に集約し、無関係なコンテキストのネガティブな影響に対処する。実験により、RMCDは複数のLVLMで他の復号法より一貫して優れており、3つの知識集約型視覚質問応答ベンチマークで最高の性能を達成していることが示された。また、RMCDはLVLMの復号法を追加訓練なしで置き換えることによって簡単に適用できる。また、RMCDは検索結果に対して頑健であり、最弱から最強の検索結果に対して常に最善を尽くしていることを示す。コードはhttps://github.com/mlvlab/RMCDで入手できる。

論文の概要: Relevance-aware Multi-context Contrastive Decoding for Retrieval-augmented Visual Question Answering

関連論文リスト