An Enhanced Dual Transformer Contrastive Network for Multimodal Sentiment Analysis
- URL: http://arxiv.org/abs/2510.23617v1
- Date: Mon, 20 Oct 2025 16:29:46 GMT
- Title: An Enhanced Dual Transformer Contrastive Network for Multimodal Sentiment Analysis
- Authors: Phuong Q. Dao, Mark Roantree, Vuong M. Ngo,
- Abstract summary: We first propose BERT-ViT-EF, a novel model that combines powerful Transformer-based encoders BERT for textual input and ViT for visual input through an early fusion strategy.<n>To further enhance the model's capability, we propose an extension called the Dual Transformer Contrastive Network (DTCN)
- Score: 1.0399530974344653
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal Sentiment Analysis (MSA) seeks to understand human emotions by jointly analyzing data from multiple modalities typically text and images offering a richer and more accurate interpretation than unimodal approaches. In this paper, we first propose BERT-ViT-EF, a novel model that combines powerful Transformer-based encoders BERT for textual input and ViT for visual input through an early fusion strategy. This approach facilitates deeper cross-modal interactions and more effective joint representation learning. To further enhance the model's capability, we propose an extension called the Dual Transformer Contrastive Network (DTCN), which builds upon BERT-ViT-EF. DTCN incorporates an additional Transformer encoder layer after BERT to refine textual context (before fusion) and employs contrastive learning to align text and image representations, fostering robust multimodal feature learning. Empirical results on two widely used MSA benchmarks MVSA-Single and TumEmo demonstrate the effectiveness of our approach. DTCN achieves best accuracy (78.4%) and F1-score (78.3%) on TumEmo, and delivers competitive performance on MVSA-Single, with 76.6% accuracy and 75.9% F1-score. These improvements highlight the benefits of early fusion and deeper contextual modeling in Transformer-based multimodal sentiment analysis.
Related papers
- Multimodal Sentiment Analysis on CMU-MOSEI Dataset using Transformer-based Models [0.0]
This project performs multimodal sentiment analysis using the CMU-MOSEI dataset.<n>We use transformer-based models with early fusion to integrate text, audio, and visual modalities.<n>The model achieves strong performance, with 97.87% 7-class accuracy and a 0.9682 F1-score on the test set.
arXiv Detail & Related papers (2025-05-09T15:10:57Z) - Enhancing Sentiment Analysis through Multimodal Fusion: A BERT-DINOv2 Approach [2.859032340781147]
This paper proposes a novel multimodal sentiment analysis architecture that integrates text and image data to provide a more comprehensive understanding of sentiments.<n>Experiments on three datasets, Memotion 7k dataset, MVSA single dataset, and MVSA multi dataset, demonstrate the viability and practicality of the proposed multimodal architecture.
arXiv Detail & Related papers (2025-03-11T00:53:45Z) - Multimodal Sentiment Analysis Based on BERT and ResNet [0.0]
multimodal sentiment analysis framework combining BERT and ResNet was proposed.<n>BERT has shown strong text representation ability in natural language processing, and ResNet has excellent image feature extraction performance in the field of computer vision.<n> Experimental results on the public dataset MAVA-single show that compared with the single-modal models that only use BERT or ResNet, the proposed multi-modal model improves the accuracy and F1 score, reaching the best accuracy of 74.5%.
arXiv Detail & Related papers (2024-12-04T15:55:20Z) - Leveraging Entity Information for Cross-Modality Correlation Learning: The Entity-Guided Multimodal Summarization [49.08348604716746]
Multimodal Summarization with Multimodal Output (MSMO) aims to produce a multimodal summary that integrates both text and relevant images.
In this paper, we propose an Entity-Guided Multimodal Summarization model (EGMS)
Our model, building on BART, utilizes dual multimodal encoders with shared weights to process text-image and entity-image information concurrently.
arXiv Detail & Related papers (2024-08-06T12:45:56Z) - Joint Multimodal Transformer for Emotion Recognition in the Wild [49.735299182004404]
Multimodal emotion recognition (MMER) systems typically outperform unimodal systems.
This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention.
arXiv Detail & Related papers (2024-03-15T17:23:38Z) - A Multimodal Fusion Network For Student Emotion Recognition Based on Transformer and Tensor Product [4.528221075598755]
This paper introduces a new multi-modal model based on the Transformer architecture and tensor product fusion strategy.
It combines BERT's text vectors and ViT's image vectors to classify students' psychological conditions, with an accuracy of 93.65%.
arXiv Detail & Related papers (2024-03-13T13:16:26Z) - WisdoM: Improving Multimodal Sentiment Analysis by Fusing Contextual
World Knowledge [73.76722241704488]
We propose a plug-in framework named WisdoM to leverage the contextual world knowledge induced from the large vision-language models (LVLMs) for enhanced multimodal sentiment analysis.
We show that our approach has substantial improvements over several state-of-the-art methods.
arXiv Detail & Related papers (2024-01-12T16:08:07Z) - Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks.
adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations.
In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - TransModality: An End2End Fusion Method with Transformer for Multimodal
Sentiment Analysis [42.6733747726081]
We propose a new fusion method, TransModality, to address the task of multimodal sentiment analysis.
We validate our model on multiple multimodal datasets: CMU-MOSI, MELD, IEMOCAP.
arXiv Detail & Related papers (2020-09-07T06:11:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.