A Multimodal Fusion Network For Student Emotion Recognition Based on Transformer and Tensor Product
- URL: http://arxiv.org/abs/2403.08511v4
- Date: Sat, 16 Nov 2024 02:38:47 GMT
- Title: A Multimodal Fusion Network For Student Emotion Recognition Based on Transformer and Tensor Product
- Authors: Ao Xiang, Zongqing Qi, Han Wang, Qin Yang, Danqing Ma,
- Abstract summary: This paper introduces a new multi-modal model based on the Transformer architecture and tensor product fusion strategy.
It combines BERT's text vectors and ViT's image vectors to classify students' psychological conditions, with an accuracy of 93.65%.
- Score: 4.528221075598755
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces a new multi-modal model based on the Transformer architecture and tensor product fusion strategy, combining BERT's text vectors and ViT's image vectors to classify students' psychological conditions, with an accuracy of 93.65%. The purpose of the study is to accurately analyze the mental health status of students from various data sources. This paper discusses modal fusion methods, including early, late and intermediate fusion, to overcome the challenges of integrating multi-modal information. Ablation studies compare the performance of different models and fusion techniques, showing that the proposed model outperforms existing methods such as CLIP and ViLBERT in terms of accuracy and inference speed. Conclusions indicate that while this model has significant advantages in emotion recognition, its potential to incorporate other data modalities provides areas for future research.
Related papers
- Multimodal Magic Elevating Depression Detection with a Fusion of Text and Audio Intelligence [4.92323103166693]
This study proposes an innovative multimodal fusion model based on a teacher-student architecture to enhance the accuracy of depression classification.
Our designed model addresses the limitations of traditional methods in feature fusion and modality weight allocation by introducing multi-head attention mechanisms and weighted multimodal transfer learning.
Ablation experiments demonstrate that the proposed model attains an F1 score of 99. 1% on the test set, significantly outperforming unimodal and conventional approaches.
arXiv Detail & Related papers (2025-01-28T09:30:29Z) - Analyzing Persuasive Strategies in Meme Texts: A Fusion of Language Models with Paraphrase Enrichment [0.23020018305241333]
This paper describes our approach to hierarchical multi-label detection of persuasion techniques in meme texts.
The scope of the study encompasses enhancing model performance through innovative training techniques and data augmentation strategies.
arXiv Detail & Related papers (2024-07-01T20:25:20Z) - FusionBench: A Comprehensive Benchmark of Deep Model Fusion [78.80920533793595]
Deep model fusion is a technique that unifies the predictions or parameters of several deep neural networks into a single model.
FusionBench is the first comprehensive benchmark dedicated to deep model fusion.
arXiv Detail & Related papers (2024-06-05T13:54:28Z) - Application of Multimodal Fusion Deep Learning Model in Disease Recognition [14.655086303102575]
This paper introduces an innovative multi-modal fusion deep learning approach to overcome the drawbacks of traditional single-modal recognition techniques.
During the feature extraction stage, cutting-edge deep learning models are applied to distill advanced features from image-based, temporal, and structured data sources.
The findings demonstrate significant advantages of the multimodal fusion model across multiple evaluation metrics.
arXiv Detail & Related papers (2024-05-22T23:09:49Z) - Joint Multimodal Transformer for Emotion Recognition in the Wild [49.735299182004404]
Multimodal emotion recognition (MMER) systems typically outperform unimodal systems.
This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention.
arXiv Detail & Related papers (2024-03-15T17:23:38Z) - Coupled generator decomposition for fusion of electro- and magnetoencephalography data [1.7102695043811291]
Data fusion modeling can identify common features across diverse data sources while accounting for source-specific variability.
We introduce the concept of a textitcoupled generator decomposition and demonstrate how it generalizes sparse principal component analysis for data fusion.
arXiv Detail & Related papers (2024-03-02T12:09:16Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - Multimodal Relation Extraction with Cross-Modal Retrieval and Synthesis [89.04041100520881]
This research proposes to retrieve textual and visual evidence based on the object, sentence, and whole image.
We develop a novel approach to synthesize the object-level, image-level, and sentence-level information for better reasoning between the same and different modalities.
arXiv Detail & Related papers (2023-05-25T15:26:13Z) - Multimodal E-Commerce Product Classification Using Hierarchical Fusion [0.0]
The proposed method significantly outperformed the unimodal models' performance and the reported performance of similar models on our specific task.
We did experiments with multiple fusing techniques and found, that the best performing technique to combine the individual embedding of the unimodal network is based on combining concatenation and averaging the feature vectors.
arXiv Detail & Related papers (2022-07-07T14:04:42Z) - Multimodal Image Synthesis and Editing: The Generative AI Era [131.9569600472503]
multimodal image synthesis and editing has become a hot research topic in recent years.
We comprehensively contextualize the advance of the recent multimodal image synthesis and editing.
We describe benchmark datasets and evaluation metrics as well as corresponding experimental results.
arXiv Detail & Related papers (2021-12-27T10:00:16Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z) - TransModality: An End2End Fusion Method with Transformer for Multimodal
Sentiment Analysis [42.6733747726081]
We propose a new fusion method, TransModality, to address the task of multimodal sentiment analysis.
We validate our model on multiple multimodal datasets: CMU-MOSI, MELD, IEMOCAP.
arXiv Detail & Related papers (2020-09-07T06:11:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.