DM$^2$S$^2$: Deep Multi-Modal Sequence Sets with Hierarchical Modality
Attention
- URL: http://arxiv.org/abs/2209.03126v1
- Date: Wed, 7 Sep 2022 13:25:09 GMT
- Title: DM$^2$S$^2$: Deep Multi-Modal Sequence Sets with Hierarchical Modality
Attention
- Authors: Shunsuke Kitada, Yuki Iwazaki, Riku Togashi, Hitoshi Iyatomi
- Abstract summary: Methods for extracting important information from multimodal data rely on a mid-fusion architecture.
We propose a new concept that considers multimodal inputs as a set of sequences, namely, deep multimodal sequence sets.
Our concept exhibits performance that is comparable to or better than the previous set-level models.
- Score: 8.382710169577447
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There is increasing interest in the use of multimodal data in various web
applications, such as digital advertising and e-commerce. Typical methods for
extracting important information from multimodal data rely on a mid-fusion
architecture that combines the feature representations from multiple encoders.
However, as the number of modalities increases, several potential problems with
the mid-fusion model structure arise, such as an increase in the dimensionality
of the concatenated multimodal features and missing modalities. To address
these problems, we propose a new concept that considers multimodal inputs as a
set of sequences, namely, deep multimodal sequence sets (DM$^2$S$^2$). Our
set-aware concept consists of three components that capture the relationships
among multiple modalities: (a) a BERT-based encoder to handle the inter- and
intra-order of elements in the sequences, (b) intra-modality residual attention
(IntraMRA) to capture the importance of the elements in a modality, and (c)
inter-modality residual attention (InterMRA) to enhance the importance of
elements with modality-level granularity further. Our concept exhibits
performance that is comparable to or better than the previous set-aware models.
Furthermore, we demonstrate that the visualization of the learned InterMRA and
IntraMRA weights can provide an interpretation of the prediction results.
Related papers
- DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout.
DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder.
Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z) - Asynchronous Multimodal Video Sequence Fusion via Learning Modality-Exclusive and -Agnostic Representations [19.731611716111566]
We propose a Multimodal fusion approach for learning modality-Exclusive and modality-Agnostic representations.
We introduce a predictive self-attention module to capture reliable context dynamics within modalities.
A hierarchical cross-modal attention module is designed to explore valuable element correlations among modalities.
A double-discriminator strategy is presented to ensure the production of distinct representations in an adversarial manner.
arXiv Detail & Related papers (2024-07-06T04:36:48Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - Mixture-of-Prompt-Experts for Multi-modal Semantic Understanding [7.329728566839757]
We propose Mixture-of-Prompt-Experts with Block-Aware Prompt Fusion (MoPE-BAF)
MoPE-BAF is a novel multi-modal soft prompt framework based on the unified vision-language model (VLM)
arXiv Detail & Related papers (2024-03-17T19:12:26Z) - Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion [70.9767518332692]
Multimodal Large Language Models (MLLMs) that incorporate LLMs with pre-trained vision models have recently demonstrated impressive performance across diverse vision-language tasks.
However, they fall short to comprehend context involving multiple images.
We propose a two phase paradigm, browse-and-concentrate, to enable in-depth multimodal context fusion.
arXiv Detail & Related papers (2024-02-19T14:59:07Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal
Prediction for Multimodal Sentiment Analysis [19.07020276666615]
We propose a novel framework named MultiModal Contrastive Learning (MMCL) for multimodal representation to capture intra- and inter-modality dynamics simultaneously.
We also design two contrastive learning tasks, instance- and sentiment-based contrastive learning, to promote the process of prediction and learn more interactive information related to sentiment.
arXiv Detail & Related papers (2022-10-26T08:24:15Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z) - Adversarial Multimodal Representation Learning for Click-Through Rate
Prediction [16.10640369157054]
We propose a novel Multimodal Adversarial Representation Network (MARN) for the Click-Through Rate (CTR) prediction task.
A multimodal attention network first calculates the weights of multiple modalities for each item according to its modality-specific features.
A multimodal adversarial network learns modality-in representations where a double-discriminators strategy is introduced.
arXiv Detail & Related papers (2020-03-07T15:50:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.