Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment
Analysis in Videos
- URL: http://arxiv.org/abs/2206.07981v2
- Date: Fri, 17 Jun 2022 02:58:20 GMT
- Title: Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment
Analysis in Videos
- Authors: Lianyang Ma, Yu Yao, Tao Liang, Tongliang Liu
- Abstract summary: We propose a multi-scale cooperative multimodal transformer (MCMulT) architecture for multimodal sentiment analysis.
Our model outperforms existing approaches on unaligned multimodal sequences and has strong performance on aligned multimodal sequences.
- Score: 58.93586436289648
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal sentiment analysis in videos is a key task in many real-world
applications, which usually requires integrating multimodal streams including
visual, verbal and acoustic behaviors. To improve the robustness of multimodal
fusion, some of the existing methods let different modalities communicate with
each other and modal the crossmodal interaction via transformers. However,
these methods only use the single-scale representations during the interaction
but forget to exploit multi-scale representations that contain different levels
of semantic information. As a result, the representations learned by
transformers could be biased especially for unaligned multimodal data. In this
paper, we propose a multi-scale cooperative multimodal transformer (MCMulT)
architecture for multimodal sentiment analysis. On the whole, the "multi-scale"
mechanism is capable of exploiting the different levels of semantic information
of each modality which are used for fine-grained crossmodal interactions.
Meanwhile, each modality learns its feature hierarchies via integrating the
crossmodal interactions from multiple level features of its source modality. In
this way, each pair of modalities progressively builds feature hierarchies
respectively in a cooperative manner. The empirical results illustrate that our
MCMulT model not only outperforms existing approaches on unaligned multimodal
sequences but also has strong performance on aligned multimodal sequences.
Related papers
- CMATH: Cross-Modality Augmented Transformer with Hierarchical Variational Distillation for Multimodal Emotion Recognition in Conversation [8.874033487493913]
Multimodal emotion recognition in conversation aims to accurately identify emotions in conversational utterances.
We propose a novel Cross-Modality Augmented Transformer with Hierarchical Variational Distillation, called CMATH, which consists of two major components.
Experiments on the IEMOCAP and MELD datasets demonstrate that our proposed model outperforms previous state-of-the-art baselines.
arXiv Detail & Related papers (2024-11-15T09:23:02Z) - What to align in multimodal contrastive learning? [7.7439394183358745]
We introduce Contrastive MultiModal learning strategy that enables the communication between modalities in a single multimodal space.
Our theoretical analysis shows that shared, synergistic and unique terms of information naturally emerge from this formulation, allowing us to estimate multimodal interactions beyond redundancy.
In the latter, CoMM learns complex multimodal interactions and achieves state-of-the-art results on the six multimodal benchmarks.
arXiv Detail & Related papers (2024-09-11T16:42:22Z) - Hierarchical Multi-modal Transformer for Cross-modal Long Document Classification [74.45521856327001]
How to classify long documents with hierarchical structure texts and embedding images is a new problem.
We propose a novel approach called Hierarchical Multi-modal Transformer (HMT) for cross-modal long document classification.
Our approach uses a multi-modal transformer and a dynamic multi-scale multi-modal transformer to model the complex relationships between image features, and the section and sentence features.
arXiv Detail & Related papers (2024-07-14T07:12:25Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - Multimodal Information Interaction for Medical Image Segmentation [24.024848382458767]
We introduce an innovative Multimodal Information Cross Transformer (MicFormer)
It queries features from one modality and retrieves corresponding responses from another, facilitating effective communication between bimodal features.
Compared to other multimodal segmentation techniques, our method outperforms by margins of 2.83 and 4.23, respectively.
arXiv Detail & Related papers (2024-04-25T07:21:14Z) - Multi-modal Semantic Understanding with Contrastive Cross-modal Feature
Alignment [11.897888221717245]
This paper proposes a novel CLIP-guided contrastive-learning-based architecture to perform multi-modal feature alignment.
Our model is simple to implement without using task-specific external knowledge, and thus can easily migrate to other multi-modal tasks.
arXiv Detail & Related papers (2024-03-11T01:07:36Z) - Multimodal Representation Learning by Alternating Unimodal Adaptation [73.15829571740866]
We propose MLA (Multimodal Learning with Alternating Unimodal Adaptation) to overcome challenges where some modalities appear more dominant than others during multimodal learning.
MLA reframes the conventional joint multimodal learning process by transforming it into an alternating unimodal learning process.
It captures cross-modal interactions through a shared head, which undergoes continuous optimization across different modalities.
Experiments are conducted on five diverse datasets, encompassing scenarios with complete modalities and scenarios with missing modalities.
arXiv Detail & Related papers (2023-11-17T18:57:40Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Learning Deep Multimodal Feature Representation with Asymmetric
Multi-layer Fusion [63.72912507445662]
We propose a compact and effective framework to fuse multimodal features at multiple layers in a single network.
We verify that multimodal features can be learnt within a shared single network by merely maintaining modality-specific batch normalization layers in the encoder.
Secondly, we propose a bidirectional multi-layer fusion scheme, where multimodal features can be exploited progressively.
arXiv Detail & Related papers (2021-08-11T03:42:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.