Related papers: Beyond Words: Enhancing Desire, Emotion, and Sentiment Recognition with Non-Verbal Cues

Beyond Words: Enhancing Desire, Emotion, and Sentiment Recognition with Non-Verbal Cues

URL: http://arxiv.org/abs/2509.15540v1
Date: Fri, 19 Sep 2025 02:49:47 GMT
Title: Beyond Words: Enhancing Desire, Emotion, and Sentiment Recognition with Non-Verbal Cues
Authors: Wei Chen, Tongguan Wang, Feiyue Xue, Junkai Li, Hui Liu, Ying Sha,
Abstract summary: Desire, as an intention that drives human behavior, is closely related to both emotion and sentiment.<n>We propose a Symmetrical Bimodal Multimodal Learning Framework for Desire, Emotion, and Sentiment Recognition.<n>Low-resolution images are used to obtain global visual representations for cross-modal alignment.<n>High-resolution images are partitioned into sub-images and modeled with masked image modeling.
Score: 13.756325086005369
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Desire, as an intention that drives human behavior, is closely related to both emotion and sentiment. Multimodal learning has advanced sentiment and emotion recognition, but multimodal approaches specially targeting human desire understanding remain underexplored. And existing methods in sentiment analysis predominantly emphasize verbal cues and overlook images as complementary non-verbal cues. To address these gaps, we propose a Symmetrical Bidirectional Multimodal Learning Framework for Desire, Emotion, and Sentiment Recognition, which enforces mutual guidance between text and image modalities to effectively capture intention-related representations in the image. Specifically, low-resolution images are used to obtain global visual representations for cross-modal alignment, while high resolution images are partitioned into sub-images and modeled with masked image modeling to enhance the ability to capture fine-grained local features. A text-guided image decoder and an image-guided text decoder are introduced to facilitate deep cross-modal interaction at both local and global representations of image information. Additionally, to balance perceptual gains with computation cost, a mixed-scale image strategy is adopted, where high-resolution images are cropped into sub-images for masked modeling. The proposed approach is evaluated on MSED, a multimodal dataset that includes a desire understanding benchmark, as well as emotion and sentiment recognition. Experimental results indicate consistent improvements over other state-of-the-art methods, validating the effectiveness of our proposed method. Specifically, our method outperforms existing approaches, achieving F1-score improvements of 1.1% in desire understanding, 0.6% in emotion recognition, and 0.9% in sentiment analysis. Our code is available at: https://github.com/especiallyW/SyDES.

Related papers

A Cross-Modal Rumor Detection Scheme via Contrastive Learning by Exploring Text and Image internal Correlations [15.703292627605304]
This paper presents a novel cross-modal rumor detection scheme based on contrastive learning.<n>A scale-aware fusion network is designed to integrate the highly pertinent multi-scale image features with global text features.<n>The experimental results demonstrate that it achieves a substantial performance improvement over existing state-of-the-art approaches in rumor detection.
arXiv Detail & Related papers (2025-08-15T01:13:50Z)
Embedding and Enriching Explicit Semantics for Visible-Infrared Person Re-Identification [31.011118085494942]
Visible-infrared person re-identification (VIReID) retrieves pedestrian images with the same identity across different modalities.<n>Existing methods learn visual content solely from images, lacking the capability to sense high-level semantics.<n>We propose an Embedding and Enriching Explicit Semantics framework to learn semantically rich cross-modality pedestrian representations.
arXiv Detail & Related papers (2024-12-11T14:27:30Z)
From Text to Pixels: A Context-Aware Semantic Synergy Solution for Infrared and Visible Image Fusion [66.33467192279514]
We introduce a text-guided multi-modality image fusion method that leverages the high-level semantics from textual descriptions to integrate semantics from infrared and visible images. Our method not only produces visually superior fusion results but also achieves a higher detection mAP over existing methods, achieving state-of-the-art results.
arXiv Detail & Related papers (2023-12-31T08:13:47Z)
StyleEDL: Style-Guided High-order Attention Network for Image Emotion Distribution Learning [69.06749934902464]
We propose a style-guided high-order attention network for image emotion distribution learning termed StyleEDL. StyleEDL interactively learns stylistic-aware representations of images by exploring the hierarchical stylistic information of visual contents. In addition, we introduce a stylistic graph convolutional network to dynamically generate the content-dependent emotion representations.
arXiv Detail & Related papers (2023-08-06T03:22:46Z)
Improving Human-Object Interaction Detection via Virtual Image Learning [68.56682347374422]
Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects. In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL) A novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images.
arXiv Detail & Related papers (2023-08-04T10:28:48Z)
High-Level Context Representation for Emotion Recognition in Images [4.987022981158291]
We propose an approach for high-level context representation extraction from images. The model relies on a single cue and a single encoding stream to correlate this representation with emotions. Our approach is more efficient than previous models and can be easily deployed to address real-world problems related to emotion recognition.
arXiv Detail & Related papers (2023-05-05T13:20:41Z)
Interpretable Multimodal Emotion Recognition using Hybrid Fusion of Speech and Image Data [15.676632465869346]
A new interpretability technique has been developed to identify the important speech & image features leading to the prediction of particular emotion classes. The proposed system has achieved 83.29% accuracy for emotion recognition.
arXiv Detail & Related papers (2022-08-25T04:43:34Z)
ViCE: Self-Supervised Visual Concept Embeddings as Contextual and Pixel Appearance Invariant Semantic Representations [77.3590853897664]
This work presents a self-supervised method to learn dense semantically rich visual embeddings for images inspired by methods for learning word embeddings in NLP.
arXiv Detail & Related papers (2021-11-24T12:27:30Z)
Learning Contrastive Representation for Semantic Correspondence [150.29135856909477]
We propose a multi-level contrastive learning approach for semantic matching. We show that image-level contrastive learning is a key component to encourage the convolutional features to find correspondence between similar objects.
arXiv Detail & Related papers (2021-09-22T18:34:14Z)
Leveraging Semantic Scene Characteristics and Multi-Stream Convolutional Architectures in a Contextual Approach for Video-Based Visual Emotion Recognition in the Wild [31.40575057347465]
We tackle the task of video-based visual emotion recognition in the wild. Standard methodologies that rely solely on the extraction of bodily and facial features often fall short of accurate emotion prediction. We aspire to alleviate this problem by leveraging visual context in the form of scene characteristics and attributes.
arXiv Detail & Related papers (2021-05-16T17:31:59Z)
Rethinking of the Image Salient Object Detection: Object-level Semantic Saliency Re-ranking First, Pixel-wise Saliency Refinement Latter [62.26677215668959]
We propose a lightweight, weakly supervised deep network to coarsely locate semantically salient regions. We then fuse multiple off-the-shelf deep models on these semantically salient regions as the pixel-wise saliency refinement. Our method is simple yet effective, which is the first attempt to consider the salient object detection mainly as an object-level semantic re-ranking problem.
arXiv Detail & Related papers (2020-08-10T07:12:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.