Related papers: Leveraging Textual-Cues for Enhancing Multimodal Sentiment Analysis by Object Recognition

Leveraging Textual-Cues for Enhancing Multimodal Sentiment Analysis by Object Recognition

URL: http://arxiv.org/abs/2602.00360v1
Date: Fri, 30 Jan 2026 22:17:29 GMT
Title: Leveraging Textual-Cues for Enhancing Multimodal Sentiment Analysis by Object Recognition
Authors: Sumana Biswas, Karen Young, Josephine Griffith,
Abstract summary: Multimodal sentiment analysis includes both image and text data.<n>Part of the approach introduces the novel Textual-Cues for Enhancing Multimodal Sentiment Analysis' (TEMSA) based on object recognition methods.
Score: 0.45880283710344055
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Multimodal sentiment analysis, which includes both image and text data, presents several challenges due to the dissimilarities in the modalities of text and image, the ambiguity of sentiment, and the complexities of contextual meaning. In this work, we experiment with finding the sentiments of image and text data, individually and in combination, on two datasets. Part of the approach introduces the novel `Textual-Cues for Enhancing Multimodal Sentiment Analysis' (TEMSA) based on object recognition methods to address the difficulties in multimodal sentiment analysis. Specifically, we extract the names of all objects detected in an image and combine them with associated text; we call this combination of text and image data TEMS. Our results demonstrate that only TEMS improves the results when considering all the object names for the overall sentiment of multimodal data compared to individual analysis. This research contributes to advancing multimodal sentiment analysis and offers insights into the efficacy of TEMSA in combining image and text data for multimodal sentiment analysis.

Related papers

TextInVision: Text and Prompt Complexity Driven Visual Text Generation Benchmark [61.412934963260724]
Existing diffusion-based text-to-image models often struggle to accurately embed text within images.<n>We introduce TextInVision, a large-scale, text and prompt complexity driven benchmark to evaluate the ability of diffusion models to integrate visual text into images.
arXiv Detail & Related papers (2025-03-17T21:36:31Z)
Enhancing Sentiment Analysis through Multimodal Fusion: A BERT-DINOv2 Approach [2.859032340781147]
This paper proposes a novel multimodal sentiment analysis architecture that integrates text and image data to provide a more comprehensive understanding of sentiments.<n>Experiments on three datasets, Memotion 7k dataset, MVSA single dataset, and MVSA multi dataset, demonstrate the viability and practicality of the proposed multimodal architecture.
arXiv Detail & Related papers (2025-03-11T00:53:45Z)
Large Language Models Meet Text-Centric Multimodal Sentiment Analysis: A Survey [66.166184609616]
ChatGPT has opened up immense potential for applying large language models (LLMs) to text-centric multimodal tasks. It is still unclear how existing LLMs can adapt better to text-centric multimodal sentiment analysis tasks.
arXiv Detail & Related papers (2024-06-12T10:36:27Z)
Towards Unified Multi-granularity Text Detection with Interactive Attention [56.79437272168507]
"Detect Any Text" is an advanced paradigm that unifies scene text detection, layout analysis, and document page detection into a cohesive, end-to-end model. A pivotal innovation in DAT is the across-granularity interactive attention module, which significantly enhances the representation learning of text instances. Tests demonstrate that DAT achieves state-of-the-art performances across a variety of text-related benchmarks.
arXiv Detail & Related papers (2024-05-30T07:25:23Z)
mTREE: Multi-Level Text-Guided Representation End-to-End Learning for Whole Slide Image Analysis [16.472295458683696]
Multi-modal learning adeptly integrates visual and textual data, but its application to histopathology image and text analysis remains challenging. We introduce Multi-Level Text-Guided Representation End-to-End Learning (mTREE) This novel text-guided approach effectively captures multi-scale Whole Slide Images (WSIs) by utilizing information from accompanying textual pathology information.
arXiv Detail & Related papers (2024-05-28T04:47:44Z)
WisdoM: Improving Multimodal Sentiment Analysis by Fusing Contextual World Knowledge [73.76722241704488]
We propose a plug-in framework named WisdoM to leverage the contextual world knowledge induced from the large vision-language models (LVLMs) for enhanced multimodal sentiment analysis. We show that our approach has substantial improvements over several state-of-the-art methods.
arXiv Detail & Related papers (2024-01-12T16:08:07Z)
Holistic Visual-Textual Sentiment Analysis with Prior Models [64.48229009396186]
We propose a holistic method that achieves robust visual-textual sentiment analysis. The proposed method consists of four parts: (1) a visual-textual branch to learn features directly from data for sentiment analysis, (2) a visual expert branch with a set of pre-trained "expert" encoders to extract selected semantic visual features, (3) a CLIP branch to implicitly model visual-textual correspondence, and (4) a multimodal feature fusion network based on BERT to fuse multimodal features and make sentiment predictions.
arXiv Detail & Related papers (2022-11-23T14:40:51Z)
An AutoML-based Approach to Multimodal Image Sentiment Analysis [1.0499611180329804]
We propose a method that combines both textual and image individual sentiment analysis into a final fused classification based on AutoML. Our method achieved state-of-the-art performance in the B-T4SA dataset, with 95.19% accuracy.
arXiv Detail & Related papers (2021-02-16T11:28:50Z)
Transformer-based Multi-Aspect Modeling for Multi-Aspect Multi-Sentiment Analysis [56.893393134328996]
We propose a novel Transformer-based Multi-aspect Modeling scheme (TMM), which can capture potential relations between multiple aspects and simultaneously detect the sentiment of all aspects in a sentence. Our method achieves noticeable improvements compared with strong baselines such as BERT and RoBERTa.
arXiv Detail & Related papers (2020-11-01T11:06:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.