Fugu-MT 論文翻訳(概要): Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

論文の概要: Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

arxiv url: http://arxiv.org/abs/2005.07310v2
Date: Sat, 18 Jul 2020 23:10:35 GMT
ステータス: 翻訳完了
システム内更新日: 2022-12-02 22:25:22.817677
Title: Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models
Title（参考訳）: 舞台裏:事前訓練された視覚言語モデルの秘密を明らかにする
Authors: Jize Cao, Zhe Gan, Yu Cheng, Licheng Yu, Yen-Chun Chen and Jingjing Liu
Abstract要約: 最近のTransformerベースの大規模事前学習モデルは、視覚言語(V+L)研究に革命をもたらした。 VALUEは,マルチモーダル事前学習における内部動作の解明を目的とした,精密に設計された探索タスクのセットである。主要な観察:事前訓練されたモデルは、推論中の画像よりもテキストに出席する傾向を示す。
参考スコア（独自算出の注目度）: 65.19308052012858
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research. Models such as ViLBERT, LXMERT and UNITER have significantly lifted state of the art across a wide range of V+L benchmarks with joint image-text pre-training. However, little is known about the inner mechanisms that destine their impressive success. To reveal the secrets behind the scene of these powerful models, we present VALUE (Vision-And-Language Understanding Evaluation), a set of meticulously designed probing tasks (e.g., Visual Coreference Resolution, Visual Relation Detection, Linguistic Probing Tasks) generalizable to standard pre-trained V+L models, aiming to decipher the inner workings of multimodal pre-training (e.g., the implicit knowledge garnered in individual attention heads, the inherent cross-modal alignment learned through contextualized multimodal embeddings). Through extensive analysis of each archetypal model architecture via these probing tasks, our key observations are: (i) Pre-trained models exhibit a propensity for attending over text rather than images during inference. (ii) There exists a subset of attention heads that are tailored for capturing cross-modal interactions. (iii) Learned attention matrix in pre-trained models demonstrates patterns coherent with the latent alignment between image regions and textual words. (iv) Plotted attention patterns reveal visually-interpretable relations among image regions. (v) Pure linguistic knowledge is also effectively encoded in the attention heads. These are valuable insights serving to guide future work towards designing better model architecture and objectives for multimodal pre-training.
Abstract（参考訳）: 最近のトランスフォーマーベースの大規模事前学習モデルが視覚言語研究(v+l)に革命をもたらした。 ViLBERT、LXMERT、UNITERといったモデルでは、共同画像テキストによる事前トレーニングを備えた広範囲なV+Lベンチマークにおいて、技術の現状が大幅に向上している。しかし、その印象的な成功を阻害する内部機構についてはほとんど知られていない。 To reveal the secrets behind the scene of these powerful models, we present VALUE (Vision-And-Language Understanding Evaluation), a set of meticulously designed probing tasks (e.g., Visual Coreference Resolution, Visual Relation Detection, Linguistic Probing Tasks) generalizable to standard pre-trained V+L models, aiming to decipher the inner workings of multimodal pre-training (e.g., the implicit knowledge garnered in individual attention heads, the inherent cross-modal alignment learned through contextualized multimodal embeddings). これらの探索タスクを通じて、各アーチティパルモデルアーキテクチャの広範な分析を通じて、我々の重要な観察は以下のとおりである。 (i)事前学習されたモデルでは、推論中の画像ではなく、テキストで参加する傾向を示す。 (ii)クロスモーダル相互作用を捉えるために調整されたアテンションヘッドのサブセットが存在する。 (iii)事前学習モデルにおける学習注意行列は、画像領域とテキスト単語間の潜在的アライメントと一致するパターンを示す。 (4)注意パターンは画像領域間で視覚的に解釈可能な関係を示す。 (v)純粋言語知識は、注意ヘッドにおいても効果的に符号化される。これらは、よりよいモデルアーキテクチャの設計とマルチモーダル事前トレーニングの目的に向けた今後の取り組みを導く上で役立つ貴重な洞察である。

関連論文リスト

Cross-Modal Consistency in Multimodal Large Language Models [33.229271701817616]
クロスモーダル一貫性という新しい概念を導入する。実験結果から, GPT-4V内における視覚と言語モダリティの矛盾が明らかとなった。我々の研究は、そのようなモデルの適切な利用に関する洞察と、その設計を強化するための潜在的な道のヒントを得る。
論文参考訳（メタデータ） (2024-11-14T08:22:42Z)
Expedited Training of Visual Conditioned Language Generation via Redundancy Reduction [61.16125290912494]
$textEVL_textGen$は、視覚条件付き言語生成モデルの事前トレーニング用に設計されたフレームワークである。提案手法は,視覚言語モデルの学習を5倍に加速させるが,全体的な性能に顕著な影響を与えないことを示す。
論文参考訳（メタデータ） (2023-10-05T03:40:06Z)
GPT4Image: Can Large Pre-trained Models Help Vision Models on Perception Tasks? [51.22096780511165]
本稿では,大規模な事前学習モデルから抽出した知識を利用して,CNN や ViT などのモデルが拡張表現を学習するのを支援する新しい学習パラダイムを提案する。我々は、詳細な記述を事前訓練されたエンコーダに入力し、画像の内容をエンコードするリッチなセマンティック情報でテキスト埋め込みを抽出する。
論文参考訳（メタデータ） (2023-06-01T14:02:45Z)
Localization vs. Semantics: Visual Representations in Unimodal and Multimodal Models [57.08925810659545]
既存の視覚・言語モデルと視覚のみのモデルにおける視覚表現の比較分析を行う。我々の経験的観察は、視覚・言語モデルがラベル予測タスクに優れていることを示唆している。我々の研究は、視覚学習における言語の役割に光を当て、様々な事前学習モデルの実証的なガイドとして機能することを願っている。
論文参考訳（メタデータ） (2022-12-01T05:00:18Z)
DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention [101.99313208598569]
視覚と言語(V-L)タスクは、視覚内容と自然言語の両方を理解する必要がある。視覚と言語に対する注意空間を分離したDiMBERT(Disentangled Multimodal-Attention BERT)を提案する。 DiMBERTは3つのタスクに対して最新のパフォーマンスを新たに設定する。
論文参考訳（メタデータ） (2022-10-28T23:00:40Z)
Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment Analysis [25.482853330324748]
近年,マルチモーダル・アスペクトベース感性分析 (MABSA) が注目されている。 i) クロスモーダルアライメントを無視した事前学習された視覚モデルとテキストモデル、または(ii) 一般的な事前学習タスクで事前訓練された視覚的なきめ細やかなモデルのいずれかを使用する。我々は,MABSA(MABSA)のためのタスク固有のビジョンランゲージ事前学習フレームワークを提案する。
論文参考訳（メタデータ） (2022-04-17T08:44:00Z)
A Survey of Vision-Language Pre-Trained Models [41.323956143107644]
事前訓練されたモデルは近年、ブレークネックペースで進歩している。ビジョン・アンド・ランゲージ学習の分野に事前学習を適応させ、下流タスクのパフォーマンスを向上させる方法は、マルチモーダル学習の焦点となる。
論文参考訳（メタデータ） (2022-02-18T15:15:46Z)
ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration [48.01536973731182]
ROSITAと呼ばれる新しい視覚・言語事前学習手法を提案する。クロスモーダルとイントラモーダルの知識を統合されたシーングラフに統合し、セマンティックアライメントを強化する。 ROSITAは6つのベンチマークデータセット上での3つの典型的な視覚・言語タスクにおいて、既存の最先端メソッドを大幅に上回っている。
論文参考訳（メタデータ） (2021-08-16T13:16:58Z)
WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training [71.37731379031487]
クロスモーダルコントラスト学習フレームワークにおいて,BriVLと呼ばれる2重塔前訓練モデルを提案する。単純なコントラスト学習手法を採用したopenaiクリップとは異なり,最新のメソッドmocoをクロスモーダルシナリオに適用することにより,より高度なアルゴリズムを考案する。大規模なキューベースの辞書を構築することで、BriVLは限られたGPUリソースにネガティブなサンプルを組み込むことができます。
論文参考訳（メタデータ） (2021-03-11T09:39:49Z)
Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions [92.47566804182338]
画像キャプションコーパスを使わずに教師なし事前学習により、強力なV&L表現モデルを学習できるかどうかを検討する。特に,テキストのみのコーパスと画像のみのコーパスで,マスク・アンド・予測の事前学習を行うことを提案する。 4つの英語のV&Lベンチマークで、アライメントされたデータで事前訓練されたモデルに近いこのような単純なアプローチの性能が得られた。
論文参考訳（メタデータ） (2020-10-24T08:17:54Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。