UNIMO-3: Multi-granularity Interaction for Vision-Language
Representation Learning
- URL: http://arxiv.org/abs/2305.13697v1
- Date: Tue, 23 May 2023 05:11:34 GMT
- Title: UNIMO-3: Multi-granularity Interaction for Vision-Language
Representation Learning
- Authors: Hao Yang, Can Gao, Hao L\'iu, Xinyan Xiao, Yanyan Zhao, Bing Qin
- Abstract summary: We propose the UNIMO-3 model, which has the capacity to simultaneously learn the multimodal in-layer interaction and cross-layer interaction.
Our model achieves state-of-the-art performance in various downstream tasks, and through ablation study can prove that effective cross-layer learning improves the model's ability of multimodal representation.
- Score: 35.88753097105914
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-and-language (VL) pre-training, which aims to learn a general
representation of image-text pairs that can be transferred to various
vision-and-language tasks. Compared with modeling uni-modal data, the main
challenge of the VL model is: how to learn the cross-modal interaction from
multimodal data, especially the fine-grained interaction. Existing works have
shown that fully transformer-based models that adopt attention mechanisms to
learn in-layer cross-model interaction can demonstrate impressive performance
on various cross-modal downstream tasks. However, they ignored that the
semantic information of the different modals at the same layer was not uniform,
which leads to the cross-modal interaction collapsing into a limited
multi-modal semantic information interaction. In this work, we propose the
UNIMO-3 model, which has the capacity to simultaneously learn the multimodal
in-layer interaction and cross-layer interaction. UNIMO-3 model can establish
effective connections between different layers in a cross-modal encoder, and
adaptively capture the interaction between two modalities at different levels.
The experimental results show that our model achieves state-of-the-art
performance in various downstream tasks, and through ablation study can prove
that effective cross-layer learning improves the model's ability of multimodal
representation.
Related papers
- COLLAGE: Collaborative Human-Agent Interaction Generation using Hierarchical Latent Diffusion and Language Models [14.130327598928778]
Large language models (LLMs) and hierarchical motion-specific vector-quantized variational autoencoders (VQ-VAEs) are proposed.
Our framework generates realistic and diverse collaborative human-object-human interactions, outperforming state-of-the-art methods.
Our work opens up new possibilities for modeling complex interactions in various domains, such as robotics, graphics and computer vision.
arXiv Detail & Related papers (2024-09-30T17:02:13Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z) - Switch-BERT: Learning to Model Multimodal Interactions by Switching
Attention and Input [27.102030262319197]
We present textbfSwitch-BERT for joint vision and language representation learning to address the problem of modality mismatch.
Switch-BERT extends BERT architecture by introducing learnable layer-wise and cross-layer interactions.
Results confirm that, whereas alternative architectures including ViLBERT and UNITER may excel in particular tasks, Switch-BERT can consistently achieve better or comparable performances.
arXiv Detail & Related papers (2023-06-25T09:28:40Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications [90.6849884683226]
We study the challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data.
Using a precise information-theoretic definition of interactions, our key contribution is the derivation of lower and upper bounds.
We show how these theoretical results can be used to estimate multimodal model performance, guide data collection, and select appropriate multimodal models for various tasks.
arXiv Detail & Related papers (2023-06-07T15:44:53Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - On Uni-Modal Feature Learning in Supervised Multi-Modal Learning [21.822251958013737]
We abstract the features (i.e. learned representations) of multi-modal data into 1) uni-modal features, which can be learned from uni-modal training, and 2) paired features, which can only be learned from cross-modal interactions.
We demonstrate that, under a simple guiding strategy, we can achieve comparable results to other complex late-fusion or intermediate-fusion methods on various multi-modal datasets.
arXiv Detail & Related papers (2023-05-02T07:15:10Z) - Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal
Prediction for Multimodal Sentiment Analysis [19.07020276666615]
We propose a novel framework named MultiModal Contrastive Learning (MMCL) for multimodal representation to capture intra- and inter-modality dynamics simultaneously.
We also design two contrastive learning tasks, instance- and sentiment-based contrastive learning, to promote the process of prediction and learn more interactive information related to sentiment.
arXiv Detail & Related papers (2022-10-26T08:24:15Z) - MultiViz: An Analysis Benchmark for Visualizing and Understanding
Multimodal Models [103.9987158554515]
MultiViz is a method for analyzing the behavior of multimodal models by scaffolding the problem of interpretability into 4 stages.
We show that the complementary stages in MultiViz together enable users to simulate model predictions, assign interpretable concepts to features, perform error analysis on model misclassifications, and use insights from error analysis to debug models.
arXiv Detail & Related papers (2022-06-30T18:42:06Z) - Hybrid Contrastive Learning of Tri-Modal Representation for Multimodal
Sentiment Analysis [18.4364234071951]
We propose a novel framework HyCon for hybrid contrastive learning of tri-modal representation.
Specifically, we simultaneously perform intra-/inter-modal contrastive learning and semi-contrastive learning.
Our proposed method outperforms existing works.
arXiv Detail & Related papers (2021-09-04T06:04:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.