Related papers: Contextual Attention-Based Multimodal Fusion of LLM and CNN for Sentiment Analysis

Contextual Attention-Based Multimodal Fusion of LLM and CNN for Sentiment Analysis

URL: http://arxiv.org/abs/2508.13196v1
Date: Fri, 15 Aug 2025 21:34:13 GMT
Title: Contextual Attention-Based Multimodal Fusion of LLM and CNN for Sentiment Analysis
Authors: Meriem Zerkouk, Miloud Mihoubi, Belkacem Chikhaoui,
Abstract summary: This paper introduces a novel approach for multimodal sentiment analysis on social media, particularly in the context of natural disasters.<n>Unlike conventional methods that process text and image modalities separately, our approach seamlessly integrates CNN based image analysis with Large Language Model based text processing.<n>Our model achieves a notable 2.43% increase in accuracy and 5.18% in F1-score, highlighting its efficacy in processing complex multimodal data.
Score: 0.4369550829556578
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper introduces a novel approach for multimodal sentiment analysis on social media, particularly in the context of natural disasters, where understanding public sentiment is crucial for effective crisis management. Unlike conventional methods that process text and image modalities separately, our approach seamlessly integrates Convolutional Neural Network (CNN) based image analysis with Large Language Model (LLM) based text processing, leveraging Generative Pre-trained Transformer (GPT) and prompt engineering to extract sentiment relevant features from the CrisisMMD dataset. To effectively model intermodal relationships, we introduce a contextual attention mechanism within the fusion process. Leveraging contextual-attention layers, this mechanism effectively captures intermodality interactions, enhancing the model's comprehension of complex relationships between textual and visual data. The deep neural network architecture of our model learns from these fused features, leading to improved accuracy compared to existing baselines. Experimental results demonstrate significant advancements in classifying social media data into informative and noninformative categories across various natural disasters. Our model achieves a notable 2.43% increase in accuracy and 5.18% in F1-score, highlighting its efficacy in processing complex multimodal data. Beyond quantitative metrics, our approach provides deeper insight into the sentiments expressed during crises. The practical implications extend to real time disaster management, where enhanced sentiment analysis can optimize the accuracy of emergency interventions. By bridging the gap between multimodal analysis, LLM powered text understanding, and disaster response, our work presents a promising direction for Artificial Intelligence (AI) driven crisis management solutions. Keywords:

Related papers

See, Think, Act: Online Shopper Behavior Simulation with VLM Agents [58.92444959954643]
This paper investigates the integration of visual information, specifically webpage screenshots, into behavior simulation via VLMs.<n>We employ SFT for joint action prediction and rationale generation, conditioning on the full interaction context.<n>To further enhance reasoning capabilities, we integrate RL with a hierarchical reward structure, scaled by a difficulty-aware factor.
arXiv Detail & Related papers (2025-10-22T05:07:14Z)
LLM-based Contrastive Self-Supervised AMR Learning with Masked Graph Autoencoders for Fake News Detection [17.045049022252563]
The proliferation of misinformation in the digital age has led to significant societal challenges.<n>Existing approaches often struggle with capturing long-range dependencies, complex semantic relations, and the social dynamics influencing news dissemination.<n>We propose a novel self-supervised misinformation detection framework that integrates both complex semantic relations and news propagation dynamics.
arXiv Detail & Related papers (2025-08-26T08:58:35Z)
Differential Attention for Multimodal Crisis Event Analysis [1.5030693386126894]
Social networks can be a valuable source of information during crisis events.<n>We explore vision language models (VLMs) and advanced fusion strategies to enhance the classification of crisis data.<n>Our results show that the combination of pretrained VLMs, enriched textual descriptions, and adaptive fusion strategies consistently outperforms state-of-the-art models in classification accuracy.
arXiv Detail & Related papers (2025-07-07T16:20:35Z)
Representation Decomposition for Learning Similarity and Contrastness Across Modalities for Affective Computing [19.177541719713666]
Multi-modal affective computing aims to automatically recognize and interpret human attitudes from diverse data sources such as images and text.<n>We propose a novel approach for affective computing that explicitly deconstructs visual and textual representations into shared (modality-invariant) and modality-specific components.
arXiv Detail & Related papers (2025-06-08T11:15:57Z)
Understanding the Information Propagation Effects of Communication Topologies in LLM-based Multi-Agent Systems [58.95962217043371]
We present a causal framework to analyze how agent outputs, whether correct or erroneous, propagate under topologies with varying sparsity.<n>Our empirical studies reveal that moderately sparse topologies, which effectively suppress error propagation while preserving beneficial information diffusion, typically achieve optimal task performance.<n>We propose a novel topology design approach, EIB-leanrner, that balances error suppression and beneficial information propagation by fusing connectivity patterns from both dense and sparse graphs.
arXiv Detail & Related papers (2025-05-29T11:21:48Z)
Contextual Reinforcement in Multimodal Token Compression for Large Language Models [0.0]
token compression remains a critical challenge for scaling models to handle increasingly complex and diverse datasets.<n>A novel mechanism based on contextual reinforcement is introduced, dynamically adjusting token importance through interdependencies and semantic relevance.<n>This approach enables substantial reductions in token usage while preserving the quality and coherence of information representation.
arXiv Detail & Related papers (2025-01-28T02:44:31Z)
Online Multi-modal Root Cause Analysis [61.94987309148539]
Root Cause Analysis (RCA) is essential for pinpointing the root causes of failures in microservice systems. Existing online RCA methods handle only single-modal data overlooking, complex interactions in multi-modal systems. We introduce OCEAN, a novel online multi-modal causal structure learning method for root cause localization.
arXiv Detail & Related papers (2024-10-13T21:47:36Z)
GCM-Net: Graph-enhanced Cross-Modal Infusion with a Metaheuristic-Driven Network for Video Sentiment and Emotion Analysis [2.012311338995539]
This paper presents a novel framework that leverages the multi-modal contextual information from utterances and applies metaheuristic algorithms to learn for utterance-level sentiment and emotion prediction. To show the effectiveness of our approach, we have conducted extensive evaluations on three prominent multimodal benchmark datasets.
arXiv Detail & Related papers (2024-10-02T10:07:48Z)
WisdoM: Improving Multimodal Sentiment Analysis by Fusing Contextual World Knowledge [73.76722241704488]
We propose a plug-in framework named WisdoM to leverage the contextual world knowledge induced from the large vision-language models (LVLMs) for enhanced multimodal sentiment analysis. We show that our approach has substantial improvements over several state-of-the-art methods.
arXiv Detail & Related papers (2024-01-12T16:08:07Z)
Re-mine, Learn and Reason: Exploring the Cross-modal Semantic Correlations for Language-guided HOI detection [57.13665112065285]
Human-Object Interaction (HOI) detection is a challenging computer vision task. We present a framework that enhances HOI detection by incorporating structured text knowledge.
arXiv Detail & Related papers (2023-07-25T14:20:52Z)
Multimodal Categorization of Crisis Events in Social Media [81.07061295887172]
We present a new multimodal fusion method that leverages both images and texts as input. In particular, we introduce a cross-attention module that can filter uninformative and misleading components from weak modalities. We show that our method outperforms the unimodal approaches and strong multimodal baselines by a large margin on three crisis-related tasks.
arXiv Detail & Related papers (2020-04-10T06:31:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.