Fugu-MT 論文翻訳(概要): Self-Segregating and Coordinated-Segregating Transformer for Focused Deep Multi-Modular Network for Visual Question Answering

論文の概要: Self-Segregating and Coordinated-Segregating Transformer for Focused Deep Multi-Modular Network for Visual Question Answering

arxiv url: http://arxiv.org/abs/2006.14264v1
Date: Thu, 25 Jun 2020 09:17:03 GMT
ステータス: 翻訳完了
システム内更新日: 2022-11-17 02:38:26.712492
Title: Self-Segregating and Coordinated-Segregating Transformer for Focused Deep Multi-Modular Network for Visual Question Answering
Title（参考訳）: 視覚質問応答のための集中型深層マルチモジュラーネットワークのための自己分離・協調分離トランス
Authors: Chiranjib Sur
Abstract要約: 性能向上のためのアプリケーションの内容の優先順位付けが可能な分離戦略を定義する。我々はSST(Self-Segregating Transformer)とCST(Coordinated-Segregating Transformer)の2つの戦略を定義した。この作業は、繰り返しや複数の機能のフレームを含む他の多くのアプリケーションで簡単に利用できます。
参考スコア（独自算出の注目度）: 9.89901717499058
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Attention mechanism has gained huge popularity due to its effectiveness in achieving high accuracy in different domains. But attention is opportunistic and is not justified by the content or usability of the content. Transformer like structure creates all/any possible attention(s). We define segregating strategies that can prioritize the contents for the applications for enhancement of performance. We defined two strategies: Self-Segregating Transformer (SST) and Coordinated-Segregating Transformer (CST) and used it to solve visual question answering application. Self-segregation strategy for attention contributes in better understanding and filtering the information that can be most helpful for answering the question and create diversity of visual-reasoning for attention. This work can easily be used in many other applications that involve repetition and multiple frames of features and would reduce the commonality of the attentions to a great extent. Visual Question Answering (VQA) requires understanding and coordination of both images and textual interpretations. Experiments demonstrate that segregation strategies for cascaded multi-head transformer attention outperforms many previous works and achieved considerable improvement for VQA-v2 dataset benchmark.
Abstract（参考訳）: 注意機構は、異なるドメインで高い精度を達成する効果により、大きな人気を集めている。しかし、注目は機会的であり、コンテンツの内容やユーザビリティによって正当化されていない。トランスフォーマーのような構造は、あらゆる可能な注意を喚起する。性能向上のためのアプリケーションの内容の優先順位付けが可能な分離戦略を定義する。我々は,SST(Self Segregating Transformer)とCST(Coordinated-Segregating Transformer)の2つの戦略を定義した。注意のための自己分離戦略は、質問に答え、注意のための視覚的推論の多様性を生み出すのに最も役立つ情報の理解とフィルタリングに寄与する。この作業は、繰り返しや複数の特徴のフレームを含む他の多くのアプリケーションで容易に利用することができ、注意の共通性を大幅に減らすことができる。 VQA(Visual Question Answering)は、画像とテキストの解釈の両方の理解と調整を必要とする。実験により、カスケード型マルチヘッドトランスフォーマーアテンションの分離戦略は、過去の多くの作業より優れており、VQA-v2データセットベンチマークでかなりの改善が得られた。

関連論文リスト

PartFormer: Awakening Latent Diverse Representation from Vision Transformer for Object Re-Identification [73.64560354556498]
Vision Transformer (ViT) は、ほとんどの異なる訓練データ領域に過度に適合する傾向にあり、その一般化性と全体的対象特徴への注意が制限される。本稿では、オブジェクトRe-IDタスクの制限を克服するために設計された、ViTの革新的な適応であるPartFormerを紹介する。我々のフレームワークは、最も困難なMSMT17データセットにおいて、最先端の2.4%のmAPスコアを著しく上回る。
論文参考訳（メタデータ） (2024-08-29T16:31:05Z)
Plug-and-Play Regulators for Image-Text Matching [76.28522712930668]
微細な対応と視覚的セマンティックなアライメントの爆発は、画像とテキストのマッチングにおいて大きな可能性を秘めている。我々は、メッセージ出力を効率的にエンコードして、コンテキストを自動生成し、モーダル表現を集約する、シンプルだが非常に効果的な2つのレギュレータを開発した。 MSCOCOとFlickr30Kデータセットの実験は、複数のモデルで印象的で一貫したR@1ゲインをもたらすことができることを実証している。
論文参考訳（メタデータ） (2023-03-23T15:42:05Z)
MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
視覚的な質問に答えることを学ぶことは、マルチモーダル入力が2つの特徴空間内にあるため、難しい作業である。視覚質問応答タスク(MGA-VQA)のための多言語アライメントアーキテクチャを提案する。我々のモデルはアライメントを異なるレベルに分割し、追加のデータやアノテーションを必要とせずにより良い相関関係を学習します。
論文参考訳（メタデータ） (2022-01-25T22:30:54Z)
On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering [5.547800834335381]
本研究では,ネットワークが質問に答えつつ,関連する領域に注目する上で,コアテンション・トランスフォーマー・レイヤの有効性について検討する。我々は,これらのコアテンション層における疑問条件付きイメージアテンションスコアを用いて視覚アテンションマップを生成する。我々の研究は、コ・アテンション・トランスフォーマー・レイヤの機能と解釈に光を当て、現在のネットワークのギャップを強調し、将来のVQAモデルの開発を導くことができる。
論文参考訳（メタデータ） (2022-01-11T14:25:17Z)
Compositional Attention: Disentangling Search and Retrieval [66.7108739597771]
Multi-head, key-value attention は Transformer モデルとそのバリエーションのバックボーンである。標準的なアテンションヘッドは、検索と検索の間の厳密なマッピングを学ぶ。本稿では,標準ヘッド構造を置き換える新しいアテンション機構であるコンポジションアテンションアテンションを提案する。
論文参考訳（メタデータ） (2021-10-18T15:47:38Z)
Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks [34.32609892928909]
外部注意と呼ばれる新しい注意機構を,外部的,小さく,学習可能,共有的記憶の2つに基づいて提案する。提案手法は,自己保持機構とその変種に匹敵する性能を有し,計算コストとメモリコストを大幅に低減する。
論文参考訳（メタデータ） (2021-05-05T22:29:52Z)
Spatially Aware Multimodal Transformers for TextVQA [61.01618988620582]
我々はTextVQAタスク、すなわち画像中のテキストを推論して質問に答えるタスクについて研究する。既存のアプローチは空間関係の使用に限られている。空間認識型自己注意層を提案する。
論文参考訳（メタデータ） (2020-07-23T17:20:55Z)
Multi-Head Attention: Collaborate Instead of Concatenate [85.71058762269374]
我々は,頭部が共有投影を学習できる,協調的な多面的アテンション層を提案する。実験により、キー/クエリの次元の共有は言語理解、機械翻訳、ビジョンに活用できることを確認した。
論文参考訳（メタデータ） (2020-06-29T20:28:52Z)
SACT: Self-Aware Multi-Space Feature Composition Transformer for Multinomial Attention for Video Captioning [9.89901717499058]
特徴長が長くなるにつれて、関連する内容の捕集を改善するための規定を含めることがますます重要になる。本研究では,多項注意(MultAtt)を生成可能な自己認識型構成変換器(SACT)を新たに導入した。本研究では,高密度映像キャプションのための自己認識合成変換器モデルを提案し,この手法をActivityNetやYouCookIIなどのベンチマークデータセットに適用する。
論文参考訳（メタデータ） (2020-06-25T09:11:49Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。