DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention
- URL: http://arxiv.org/abs/2210.16431v1
- Date: Fri, 28 Oct 2022 23:00:40 GMT
- Title: DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention
- Authors: Fenglin Liu, Xian Wu, Shen Ge, Xuancheng Ren, Wei Fan, Xu Sun, Yuexian
Zou
- Abstract summary: Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
- Score: 101.99313208598569
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-and-language (V-L) tasks require the system to understand both vision
content and natural language, thus learning fine-grained joint representations
of vision and language (a.k.a. V-L representations) is of paramount importance.
Recently, various pre-trained V-L models are proposed to learn V-L
representations and achieve improved results in many tasks. However, the
mainstream models process both vision and language inputs with the same set of
attention matrices. As a result, the generated V-L representations are
entangled in one common latent space. To tackle this problem, we propose
DiMBERT (short for Disentangled Multimodal-Attention BERT), which is a novel
framework that applies separated attention spaces for vision and language, and
the representations of multi-modalities can thus be disentangled explicitly. To
enhance the correlation between vision and language in disentangled spaces, we
introduce the visual concepts to DiMBERT which represent visual information in
textual format. In this manner, visual concepts help to bridge the gap between
the two modalities. We pre-train DiMBERT on a large amount of image-sentence
pairs on two tasks: bidirectional language modeling and sequence-to-sequence
language modeling. After pre-train, DiMBERT is further fine-tuned for the
downstream tasks. Experiments show that DiMBERT sets new state-of-the-art
performance on three tasks (over four datasets), including both generation
tasks (image captioning and visual storytelling) and classification tasks
(referring expressions). The proposed DiM (short for Disentangled
Multimodal-Attention) module can be easily incorporated into existing
pre-trained V-L models to boost their performance, up to a 5% increase on the
representative task. Finally, we conduct a systematic analysis and demonstrate
the effectiveness of our DiM and the introduced visual concepts.
Related papers
- MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception [24.406224705072763]
Mutually Reinforced Multimodal Large Language Model (MR-MLLM) is a novel framework that enhances visual perception and multimodal comprehension.
First, a shared query fusion mechanism is proposed to harmonize detailed visual inputs from vision models with the linguistic depth of language models.
Second, we propose the perception-enhanced cross-modal integration method, incorporating novel modalities from vision perception outputs.
arXiv Detail & Related papers (2024-06-22T07:10:36Z) - LION : Empowering Multimodal Large Language Model with Dual-Level Visual
Knowledge [58.82222646803248]
Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals.
Most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge.
We propose a dual-Level vIsual knedgeOwl eNhanced Multimodal Large Language Model (LION), which empowers the MLLM by injecting visual knowledge in two levels.
arXiv Detail & Related papers (2023-11-20T15:56:44Z) - Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language.
The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image.
This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z) - RC3: Regularized Contrastive Cross-lingual Cross-modal Pre-training [84.23022072347821]
We propose a regularized cross-lingual visio-textual contrastive learning objective that constrains the representation proximity of weakly-aligned visio-textual inputs.
Experiments on 5 downstream multi-modal tasks across 6 languages demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2023-05-13T14:41:05Z) - MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language
Representation Learning [23.45678557013005]
We propose a jointly masked multimodal modeling method to learn fine-grained multimodal representations.
Our method performs joint masking on image-text input and integrates both implicit and explicit targets for the masked signals to recover.
Our model achieves state-of-the-art performance on various downstream vision-language tasks, including image-text retrieval, visual question answering, visual reasoning, and weakly-supervised visual grounding.
arXiv Detail & Related papers (2022-10-09T06:31:15Z) - MVP: Multi-Stage Vision-Language Pre-Training via Multi-Level Semantic
Alignment [24.720485548282845]
We introduce concepts in both modalities to construct two-level semantic representations for language and vision.
We train the cross-modality model in two stages, namely, uni-modal learning and cross-modal learning.
Our model generates the-state-of-the-art results on several vision and language tasks.
arXiv Detail & Related papers (2022-01-29T14:30:59Z) - Behind the Scene: Revealing the Secrets of Pre-trained
Vision-and-Language Models [65.19308052012858]
Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research.
We present VALUE, a set of meticulously designed probing tasks to decipher the inner workings of multimodal pre-training.
Key observations: Pre-trained models exhibit a propensity for attending over text rather than images during inference.
arXiv Detail & Related papers (2020-05-15T01:06:54Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.