Fugu-MT 論文翻訳(概要): GeReA: Question-Aware Prompt Captions for Knowledge-based Visual Question Answering

論文の概要: GeReA: Question-Aware Prompt Captions for Knowledge-based Visual Question Answering

arxiv url: http://arxiv.org/abs/2402.02503v1
Date: Sun, 4 Feb 2024 14:28:23 GMT
ステータス: 翻訳完了
システム内更新日: 2024-02-06 19:11:14.248362
Title: GeReA: Question-Aware Prompt Captions for Knowledge-based Visual Question Answering
Title（参考訳）: GeReA:知識に基づく視覚的質問応答のための質問認識プロンプト
Authors: Ziyu Ma, Shutao Li, Bin Sun, Jianfei Cai, Zuxiang Long, and Fuyan Ma
Abstract要約: マルチモーダルな大言語モデル(MLLM)は,視覚的理解能力に優れる大規模言語モデル(LLM)よりも暗黙的な知識エンジンである,と我々は主張する。 InstructBLIPのようなMLLMに問題のある視覚と言語情報を与え、知識関連記述を生成するジェネレーション・レアソン・フレームワークであるGeReAを提案する。具体的には、質問関連画像領域と質問特化マニュアルプロンプトをMLLMに符号化し、知識関連記述を生成する。
参考スコア（独自算出の注目度）: 37.11794716736831
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Knowledge-based visual question answering (VQA) requires world knowledge beyond the image for accurate answer. Recently, instead of extra knowledge bases, a large language model (LLM) like GPT-3 is activated as an implicit knowledge engine to jointly acquire and reason the necessary knowledge for answering by converting images into textual information (e.g., captions and answer candidates). However, such conversion may introduce irrelevant information, which causes the LLM to misinterpret images and ignore visual details crucial for accurate knowledge. We argue that multimodal large language model (MLLM) is a better implicit knowledge engine than the LLM for its superior capability of visual understanding. Despite this, how to activate the capacity of MLLM as the implicit knowledge engine has not been explored yet. Therefore, we propose GeReA, a generate-reason framework that prompts a MLLM like InstructBLIP with question relevant vision and language information to generate knowledge-relevant descriptions and reasons those descriptions for knowledge-based VQA. Specifically, the question-relevant image regions and question-specific manual prompts are encoded in the MLLM to generate the knowledge relevant descriptions, referred to as question-aware prompt captions. After that, the question-aware prompt captions, image-question pair, and similar samples are sent into the multi-modal reasoning model to learn a joint knowledge-image-question representation for answer prediction. GeReA unlocks the use of MLLM as the implicit knowledge engine, surpassing all previous state-of-the-art methods on OK-VQA and A-OKVQA datasets, with test accuracies of 66.5% and 63.3% respectively. Our code will be released at https://github.com/Upper9527/GeReA.
Abstract（参考訳）: 知識に基づく視覚的質問応答(VQA)は、正確な回答のために、画像以外の世界の知識を必要とする。近年、余分な知識ベースの代わりに、gpt-3のような大きな言語モデル(llm)が暗黙の知識エンジンとして活性化され、画像からテキスト情報(キャプションや回答候補など)に変換して、応答に必要な知識を共同取得し、推論する。しかし、そのような変換は無関係な情報を導入し、LCMは画像を誤解釈し、正確な知識に不可欠な視覚的詳細を無視する。マルチモーダルな大言語モデル(MLLM)は視覚的理解の優れた能力において,LLMよりも暗黙的な知識エンジンである,と我々は主張する。それにもかかわらず、暗黙の知識エンジンとしてMLLMの容量を活性化する方法はまだ検討されていない。そこで本稿では,知識関連記述を生成するために,インストラクトBLIPのようなMLLMに関連性のある視覚情報や言語情報を提供する生成推論フレームワークであるGeReAを提案する。具体的には、質問関連画像領域と質問特化マニュアルプロンプトをMLLMに符号化し、質問対応プロンプトキャプションと呼ばれる知識関連記述を生成する。その後、質問認識プロンプトキャプション、画像検索ペア、および同様のサンプルをマルチモーダル推論モデルに送信し、回答予測のための共同知識検索表現を学習する。 GeReAはMLLMを暗黙の知識エンジンとして使用し、OK-VQAデータセットとA-OKVQAデータセットのすべての最先端メソッドをそれぞれ66.5%と63.3%の精度で上回っている。私たちのコードはhttps://github.com/Upper9527/GeReAでリリースされます。

関連論文リスト

Knowledge Acquisition Disentanglement for Knowledge-based Visual Question Answering with Large Language Models [10.526705722339775]
KVQA (Knowledge-based Visual Question Answering) は、質問に答えるために、画像と世界の両方の知識を必要とする。現在の手法は、まず最初に元の複雑な質問で画像と外部知識ベースから知識を取得し、次にLarge Language Models (LLM)で回答を生成する。 DKA: LLMフィードバックからの解答知識獲得(DKA: Disentangled Knowledge Acquisition)を提案する。
論文参考訳（メタデータ） (2024-07-22T03:05:32Z)
Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
マルチモーダルな大言語モデル(MLLM)は、膨大な高品質の画像テキストデータセットをトレーニングすることで、大きな進歩を遂げている。しかし、マスクのような細粒度や空間的に密集した情報をテキストで明示的に伝達することの難しさは、MLLMにとって困難である。本稿では、特殊な視覚モデルから派生した細粒度の外部知識をMLLMに統合する新しい視覚的プロンプト手法を提案する。
論文参考訳（メタデータ） (2024-07-05T17:43:30Z)
Untangle the KNOT: Interweaving Conflicting Knowledge and Reasoning Skills in Large Language Models [51.72963030032491]
大規模言語モデル(LLM)の知識文書は、時代遅れや誤った知識のためにLLMの記憶と矛盾する可能性がある。我々は,知識紛争解決のための新しいデータセットKNOTを構築した。
論文参考訳（メタデータ） (2024-04-04T16:40:11Z)
Filling the Image Information Gap for VQA: Prompting Large Language Models to Proactively Ask Questions [15.262736501208467]
大規模言語モデル(LLM)は、驚くべき推論能力と世界知識の維持を実証する。画像がLLMに見えないため、研究者は画像からテキストに変換してLLMを視覚的疑問推論の手順に変換する。我々は、LLMが積極的に関連する質問をし、画像のより詳細な情報を公開できるフレームワークを設計する。
論文参考訳（メタデータ） (2023-11-20T08:23:39Z)
Knowledge-Augmented Language Model Prompting for Zero-Shot Knowledge Graph Question Answering [7.888547093390469]
大言語モデル(LLM)は、ゼロショットのクローズドブック質問応答タスクを実行することができる。我々は,LSMの入力において,その知識を直接拡張することを提案する。我々のフレームワークであるKAPING(Knowledge-Augmented Language Model Prompting)は、モデルトレーニングを必要としないため、完全にゼロショットである。
論文参考訳（メタデータ） (2023-06-07T04:15:21Z)
Prophet: Prompting Large Language Models with Complementary Answer Heuristics for Knowledge-based Visual Question Answering [30.858737348472626]
知識に基づく視覚的質問応答(VQA)は、質問に答えるために、画像以外の外部知識を必要とする。近年の研究では,暗黙の知識エンジンとして強力な大規模言語モデル (LLM) を用いることで,回答に必要な知識を獲得している。本稿では,知識に基づくVQAの解答をLCMに促すための,概念的にシンプルで柔軟な,汎用的なフレームワークを提案する。
論文参考訳（メタデータ） (2023-03-03T13:05:15Z)
VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge [48.457788853408616]
本稿では,視覚的・テキスト的手がかりとともに,外部の常識知識を生成,選択,符号化する手法を提案する。 VLC-BERTは静的知識ベースを利用した既存モデルよりも優れていることを示す。
論文参考訳（メタデータ） (2022-10-24T22:01:17Z)
GreaseLM: Graph REASoning Enhanced Language Models for Question Answering [159.9645181522436]
GreaseLMは、事前訓練されたLMとグラフニューラルネットワークの符号化された表現を、複数の層にわたるモダリティ相互作用操作で融合する新しいモデルである。 GreaseLMは、状況制約と構造化知識の両方の推論を必要とする問題に、より確実に答えることができる。
論文参考訳（メタデータ） (2022-01-21T19:00:05Z)
KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA [107.7091094498848]
VQAの最も難しい質問の1つは、質問に答えるために画像に存在しない外部の知識を必要とする場合です。本研究では,解答に必要な知識が与えられたり記入されたりしないオープンドメイン知識を,トレーニング時やテスト時にも検討する。知識表現と推論には2つのタイプがあります。まず、トランスベースのモデルで教師なし言語事前トレーニングと教師付きトレーニングデータから効果的に学ぶことができる暗黙的な知識。
論文参考訳（メタデータ） (2020-12-20T20:13:02Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。