Fugu-MT 論文翻訳(概要): A Comprehensive Survey on Visual Question Answering Datasets and Algorithms

論文の概要: A Comprehensive Survey on Visual Question Answering Datasets and Algorithms

arxiv url: http://arxiv.org/abs/2411.11150v1
Date: Sun, 17 Nov 2024 18:52:06 GMT
ステータス: 翻訳完了
システム内更新日: 2024-11-28 17:07:48.529285
Title: A Comprehensive Survey on Visual Question Answering Datasets and Algorithms
Title（参考訳）: ビジュアル質問応答データセットとアルゴリズムに関する総合的調査
Authors: Raihan Kabir, Naznin Haque, Md Saiful Islam, Marium-E-Jannat,
Abstract要約: 我々は、VQAデータセットとモデルの現状を慎重に分析し、それらを異なるカテゴリにきれいに分割し、各カテゴリの方法論と特徴を要約する。 VQAモデルの6つの主要なパラダイムを探求する。融合、注意、あるモードからの情報を用いて、別のモードからの情報をフィルタリングする技法、外部知識ベース、構成または推論、グラフモデルである。
参考スコア（独自算出の注目度）: 1.941892373913038
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual question answering (VQA) refers to the problem where, given an image and a natural language question about the image, a correct natural language answer has to be generated. A VQA model has to demonstrate both the visual understanding of the image and the semantic understanding of the question, demonstrating reasoning capability. Since the inception of this field, a plethora of VQA datasets and models have been published. In this article, we meticulously analyze the current state of VQA datasets and models, while cleanly dividing them into distinct categories and then summarizing the methodologies and characteristics of each category. We divide VQA datasets into four categories: (1) available datasets that contain a rich collection of authentic images, (2) synthetic datasets that contain only synthetic images produced through artificial means, (3) diagnostic datasets that are specially designed to test model performance in a particular area, e.g., understanding the scene text, and (4) KB (Knowledge-Based) datasets that are designed to measure a model's ability to utilize outside knowledge. Concurrently, we explore six main paradigms of VQA models: fusion, where we discuss different methods of fusing information between visual and textual modalities; attention, the technique of using information from one modality to filter information from another; external knowledge base, where we discuss different models utilizing outside information; composition or reasoning, where we analyze techniques to answer advanced questions that require complex reasoning steps; explanation, which is the process of generating visual and textual descriptions to verify sound reasoning; and graph models, which encode and manipulate relationships through nodes in a graph. We also discuss some miscellaneous topics, such as scene text understanding, counting, and bias reduction.
Abstract（参考訳）: 視覚的質問応答(VQA)は、画像に関する画像と自然言語の質問が与えられた場合、正しい自然言語の回答を生成する必要がある問題を指す。 VQAモデルは、画像の視覚的理解と質問の意味的理解の両方を示し、推論能力を示す必要がある。この分野の誕生以来、VQAデータセットとモデルが多数発表されている。本稿では,VQAデータセットとモデルの現状を丁寧に分析し,各カテゴリの方法論と特徴を整理する。 VQAデータセットは,(1)認証画像の豊富なコレクションを含む利用可能なデータセット,(2)人工的な手段によって生成された合成画像のみを含む合成データセット,(3)特定の領域におけるモデルパフォーマンスをテストするために特別に設計された診断データセット,(4)外部知識を利用するモデルの性能を測定するように設計されたKB(知識ベース)データセットの4つのカテゴリに分類される。同時に、VQAモデルの6つの主要なパラダイムについて検討する: 融合、視覚的モダリティとテキスト的モダリティの間で情報を融合する様々な方法の議論、注意、あるモダリティから情報を使って情報をフィルタリングする技術、外部情報を利用して異なるモデルについて議論する外部知識ベース、複雑な推論ステップを必要とする高度な質問に答えるための合成または推論、説明、視覚的およびテキスト的記述の生成プロセスによる音声推論の検証、グラフ内のノード間の関係をエンコードし操作するグラフモデル。また,シーンテキスト理解やカウント,バイアス低減など,さまざまなトピックについても論じる。

関連論文リスト

Exploring Advanced Techniques for Visual Question Answering: A Comprehensive Comparison [0.0]
VQA(Visual Question Answering)は、コンピュータビジョンと自然言語処理の交差において重要なタスクとして登場した。本稿では,従来のVQAデータセット,ベースラインモデル,手法,および5つの高度なVQAモデルの比較研究について述べる。
論文参考訳（メタデータ） (2025-02-20T18:45:00Z)
Ask Questions with Double Hints: Visual Question Generation with Answer-awareness and Region-reference [107.53380946417003]
本稿では,応答認識と領域参照を用いた視覚的質問生成のための新しい学習パラダイムを提案する。我々は、追加の人間のアノテーションを導入することなく、視覚的ヒントを自己学習する簡単な手法を開発した。
論文参考訳（メタデータ） (2024-07-06T15:07:32Z)
ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model for Visual Question Answering in Vietnamese [1.6340299456362617]
ベトナムにおける様々な視覚的推論能力を評価するための先駆的な収集であるViCLEVRデータセットを紹介した。我々は、現代の視覚的推論システムの包括的な分析を行い、その強みと限界についての貴重な洞察を提供する。 PhoVITは、質問に基づいて画像中のオブジェクトを識別する総合的なマルチモーダル融合である。
論文参考訳（メタデータ） (2023-10-27T10:44:50Z)
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
本稿ではUNK-VQAと呼ばれる包括的データセットを提案する。まず、画像または疑問について意図的に摂動することで、既存のデータを拡大する。そこで我々は,新たなマルチモーダル大規模モデルのゼロショット性能と少数ショット性能を広範囲に評価した。
論文参考訳（メタデータ） (2023-10-17T02:38:09Z)
Making the V in Text-VQA Matter [1.2962828085662563]
テキストベースのVQAは,画像中のテキストを読み取って質問に答えることを目的としている。近年の研究では、データセットの問合せ対は、画像に存在するテキストにより焦点を絞っていることが示されている。このデータセットでトレーニングされたモデルは、視覚的コンテキストの理解の欠如による偏りのある回答を予測する。
論文参考訳（メタデータ） (2023-08-01T05:28:13Z)
Towards Complex Document Understanding By Discrete Reasoning [77.91722463958743]
VQA(Document Visual Question Answering)は、自然言語による質問に答えるために、視覚的に豊富なドキュメントを理解することを目的としている。我々は3,067の文書ページと16,558の質問応答ペアからなる新しいドキュメントVQAデータセットTAT-DQAを紹介する。我々は,テキスト,レイアウト,視覚画像など,多要素の情報を考慮に入れたMHSTという新しいモデルを開発し,異なるタイプの質問にインテリジェントに対処する。
論文参考訳（メタデータ） (2022-07-25T01:43:19Z)
A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge [39.788346536244504]
A-OKVQAは、約25万の質問からなるクラウドソーシングデータセットである。我々は、この新たなデータセットの可能性について、その内容の詳細な分析を通して示す。
論文参考訳（メタデータ） (2022-06-03T17:52:27Z)
MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
視覚的な質問に答えることを学ぶことは、マルチモーダル入力が2つの特徴空間内にあるため、難しい作業である。視覚質問応答タスク(MGA-VQA)のための多言語アライメントアーキテクチャを提案する。我々のモデルはアライメントを異なるレベルに分割し、追加のデータやアノテーションを必要とせずにより良い相関関係を学習します。
論文参考訳（メタデータ） (2022-01-25T22:30:54Z)
Knowledge-Routed Visual Question Reasoning: Challenges for Deep Representation Embedding [140.5911760063681]
VQAモデル評価のためのナレッジルーティング視覚質問推論という新しいデータセットを提案する。視覚ゲノムシーングラフと外部知識ベースの両方に基づいて,制御プログラムを用いて質問応答対を生成する。
論文参考訳（メタデータ） (2020-12-14T00:33:44Z)
Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering [27.042604046441426]
KVQA(Knowledge-based Visual Question Answering)は、画像に関する質問に答えるために、可視コンテンツ以外の外部知識を必要とする。本稿では,視覚的,意味的,事実的な視点から,複数の知識グラフによる画像を記述する。我々は、モデルを一連のメモリベースの推論ステップに分解し、それぞれがGラーフベースのR ead、U pdate、C ontrolによって実行される。我々は、FVQA、Visual7W-KB、OK-VQAを含む3つの人気のあるベンチマークデータセットに対して、最先端のパフォーマンスを新たに達成する。
論文参考訳（メタデータ） (2020-08-31T23:25:01Z)
Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA [96.10612095576333]
本稿では,マルチモーダルな入力源を効果的に統合し,時間的関連情報から質問に答えるビデオ質問応答モデルを提案する。また,2レベルアテンション(単語・オブジェクト・フレームレベル),異なるソース(ビデオ・高密度キャプション)に対するマルチヘッド自己統合,ゲートへのより関連性の高い情報伝達などで構成されている。当社のモデルは,各モデルコンポーネントが大きな利益をもたらす,難易度の高いTVQAデータセット上で評価され,全体的なモデルでは,最先端のモデルよりも大きなマージンで優れています。
論文参考訳（メタデータ） (2020-05-13T16:35:27Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。