Related papers: MGHFT: Multi-Granularity Hierarchical Fusion Transformer for Cross-Modal Sticker Emotion Recognition

MGHFT: Multi-Granularity Hierarchical Fusion Transformer for Cross-Modal Sticker Emotion Recognition

URL: http://arxiv.org/abs/2507.18929v1
Date: Fri, 25 Jul 2025 03:42:26 GMT
Title: MGHFT: Multi-Granularity Hierarchical Fusion Transformer for Cross-Modal Sticker Emotion Recognition
Authors: Jian Chen, Yuxuan Hu, Haifeng Lu, Wei Wang, Min Yang, Chengming Li, Xiping Hu,
Abstract summary: We propose a novel multi-granularity hierarchical fusion transformer (MGHFT)<n>We first use Multimodal Large Language Models to interpret stickers.<n>Then, we design a hierarchical fusion strategy to fuse the textual context into visual understanding.
Score: 29.045940445247872
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Although pre-trained visual models with text have demonstrated strong capabilities in visual feature extraction, sticker emotion understanding remains challenging due to its reliance on multi-view information, such as background knowledge and stylistic cues. To address this, we propose a novel multi-granularity hierarchical fusion transformer (MGHFT), with a multi-view sticker interpreter based on Multimodal Large Language Models. Specifically, inspired by the human ability to interpret sticker emotions from multiple views, we first use Multimodal Large Language Models to interpret stickers by providing rich textual context via multi-view descriptions. Then, we design a hierarchical fusion strategy to fuse the textual context into visual understanding, which builds upon a pyramid visual transformer to extract both global and local sticker features at multiple stages. Through contrastive learning and attention mechanisms, textual features are injected at different stages of the visual backbone, enhancing the fusion of global- and local-granularity visual semantics with textual guidance. Finally, we introduce a text-guided fusion attention mechanism to effectively integrate the overall multimodal features, enhancing semantic understanding. Extensive experiments on 2 public sticker emotion datasets demonstrate that MGHFT significantly outperforms existing sticker emotion recognition approaches, achieving higher accuracy and more fine-grained emotion recognition. Compared to the best pre-trained visual models, our MGHFT also obtains an obvious improvement, 5.4% on F1 and 4.0% on accuracy. The code is released at https://github.com/cccccj-03/MGHFT_ACMMM2025.

Related papers

Grounding Emotion Recognition with Visual Prototypes: VEGA -- Revisiting CLIP in MERC [28.0227032445724]
Multi Emotion Recognition in Conversations remains a challenging task due to the complex interplay of textual, acoustic and visual signals.<n>We propose a novel Visual Emotion Guided Anchoring (VEGA) mechanism that introduces class-level visual semantics into the fusion and classification process.
arXiv Detail & Related papers (2025-08-06T19:43:58Z)
MGCR-Net:Multimodal Graph-Conditioned Vision-Language Reconstruction Network for Remote Sensing Change Detection [55.702662643521265]
We propose the multimodal graph-conditioned vision-language reconstruction network (MGCR-Net) to explore the semantic interaction capabilities of multimodal data.<n> Experimental results on four public datasets demonstrate that MGCR achieves superior performance compared to mainstream CD methods.
arXiv Detail & Related papers (2025-08-03T02:50:08Z)
TeSG: Textual Semantic Guidance for Infrared and Visible Image Fusion [55.34830989105704]
Infrared and visible image fusion (IVF) aims to combine complementary information from both image modalities.<n>We introduce textual semantics at two levels: the mask semantic level and the text semantic level.<n>We propose Textual Semantic Guidance for infrared and visible image fusion, which guides the image synthesis process.
arXiv Detail & Related papers (2025-06-20T03:53:07Z)
Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought [72.93910800095757]
multimodal chain-of-thought (MCoT) improves performance and interpretability of large vision-language models (LVLMs)<n>We show that MCoT boosts LVLMs by incorporating visual thoughts, which convey image information to the reasoning process regardless of the MCoT format.<n>We also explore the internal nature of visual thoughts, finding that visual thoughts serve as intermediaries between the input image and reasoning to deeper transformer layers.
arXiv Detail & Related papers (2025-05-21T13:29:58Z)
Enriching Multimodal Sentiment Analysis through Textual Emotional Descriptions of Visual-Audio Content [56.62027582702816]
Multimodal Sentiment Analysis seeks to unravel human emotions by amalgamating text, audio, and visual data.<n>Yet, discerning subtle emotional nuances within audio and video expressions poses a formidable challenge.<n>We introduce DEVA, a progressive fusion framework founded on textual sentiment descriptions.
arXiv Detail & Related papers (2024-12-12T11:30:41Z)
Contrastive Learning-based Multi Modal Architecture for Emoticon Prediction by Employing Image-Text Pairs [13.922091192207718]
This research aims to analyze the relationship among sentences, visuals, and emoticons. We have proposed a novel contrastive learning based multimodal architecture. The proposed model attained an accuracy of 91% and an MCC-score of 90% while assessing emoticons.
arXiv Detail & Related papers (2024-08-05T15:45:59Z)
Parrot: Multilingual Visual Instruction Tuning [66.65963606552839]
Existing methods typically align vision encoders with Multimodal Large Language Models (MLLMs) via supervised fine-tuning (SFT)<n>We propose PARROT, a novel approach that leverages textual guidance for visual token alignment at the language level.<n>We introduce the Massive Multilingual Multimodal Benchmark (MMMB), a new benchmark comprising 6 languages, 15 categories, and 12,000 questions.
arXiv Detail & Related papers (2024-06-04T17:56:28Z)
Joyful: Joint Modality Fusion and Graph Contrastive Learning for Multimodal Emotion Recognition [18.571931295274975]
Multimodal emotion recognition aims to recognize emotions for each utterance of multiple modalities. Current graph-based methods fail to simultaneously depict global contextual features and local diverse uni-modal features in a dialogue. We propose a method for joint modality fusion and graph contrastive learning for multimodal emotion recognition (Joyful)
arXiv Detail & Related papers (2023-11-18T08:21:42Z)
Multimodal Prompt Transformer with Hybrid Contrastive Learning for Emotion Recognition in Conversation [9.817888267356716]
multimodal Emotion Recognition in Conversation (ERC) faces two problems. Deep emotion cues extraction was performed on modalities with strong representation ability. Feature filters were designed as multimodal prompt information for modalities with weak representation ability. MPT embeds multimodal fusion information into each attention layer of the Transformer.
arXiv Detail & Related papers (2023-10-04T13:54:46Z)
Multi-source Semantic Graph-based Multimodal Sarcasm Explanation Generation [53.97962603641629]
We propose a novel mulTi-source sEmantic grAph-based Multimodal sarcasm explanation scheme, named TEAM. TEAM extracts the object-level semantic meta-data instead of the traditional global visual features from the input image. TEAM introduces a multi-source semantic graph that comprehensively characterize the multi-source semantic relations.
arXiv Detail & Related papers (2023-06-29T03:26:10Z)
Cross-Language Speech Emotion Recognition Using Multimodal Dual Attention Transformers [5.538923337818467]
State-of-the-art systems are unable to achieve improved performance in cross-language settings. We propose a Multimodal Dual Attention Transformer model to improve cross-language SER.
arXiv Detail & Related papers (2023-06-23T22:38:32Z)
Multilevel Transformer For Multimodal Emotion Recognition [6.0149102420697025]
We introduce a novel multi-granularity framework, which combines fine-grained representation with pre-trained utterance-level representation. Inspired by Transformer TTS, we propose a multilevel transformer model to perform fine-grained multimodal emotion recognition.
arXiv Detail & Related papers (2022-10-26T10:31:24Z)
Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion [112.27103169303184]
Multimodal Knowledge Graphs (MKGs) organize visual-text factual knowledge. MKGformer can obtain SOTA performance on four datasets of multimodal link prediction, multimodal RE, and multimodal NER.
arXiv Detail & Related papers (2022-05-04T23:40:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.