Multi-Granular Multimodal Clue Fusion for Meme Understanding
- URL: http://arxiv.org/abs/2503.12560v1
- Date: Sun, 16 Mar 2025 16:16:53 GMT
- Title: Multi-Granular Multimodal Clue Fusion for Meme Understanding
- Authors: Li Zheng, Hao Fei, Ting Dai, Zuquan Peng, Fei Li, Huisheng Ma, Chong Teng, Donghong Ji,
- Abstract summary: multimodal meme understanding (MMU) task has been garnering increasing attention.<n>MMU aims to explore and comprehend the meanings of memes by performing tasks such as metaphor recognition, sentiment analysis, intention detection, and offensiveness detection.<n>We propose a multi-granular multimodal clue fusion model (MGMCF) to advance MMU.
- Score: 30.697862544992386
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the continuous emergence of various social media platforms frequently used in daily life, the multimodal meme understanding (MMU) task has been garnering increasing attention. MMU aims to explore and comprehend the meanings of memes from various perspectives by performing tasks such as metaphor recognition, sentiment analysis, intention detection, and offensiveness detection. Despite making progress, limitations persist due to the loss of fine-grained metaphorical visual clue and the neglect of multimodal text-image weak correlation. To overcome these limitations, we propose a multi-granular multimodal clue fusion model (MGMCF) to advance MMU. Firstly, we design an object-level semantic mining module to extract object-level image feature clues, achieving fine-grained feature clue extraction and enhancing the model's ability to capture metaphorical details and semantics. Secondly, we propose a brand-new global-local cross-modal interaction model to address the weak correlation between text and images. This model facilitates effective interaction between global multimodal contextual clues and local unimodal feature clues, strengthening their representations through a bidirectional cross-modal attention mechanism. Finally, we devise a dual-semantic guided training strategy to enhance the model's understanding and alignment of multimodal representations in the semantic space. Experiments conducted on the widely-used MET-MEME bilingual dataset demonstrate significant improvements over state-of-the-art baselines. Specifically, there is an 8.14% increase in precision for offensiveness detection task, and respective accuracy enhancements of 3.53%, 3.89%, and 3.52% for metaphor recognition, sentiment analysis, and intention detection tasks. These results, underpinned by in-depth analyses, underscore the effectiveness and potential of our approach for advancing MMU.
Related papers
- VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information.<n>We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning.<n>We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z) - Cross-domain Few-shot Object Detection with Multi-modal Textual Enrichment [21.36633828492347]
Cross-Domain Multi-Modal Few-Shot Object Detection (CDMM-FSOD)<n>We introduce a meta-learning-based framework designed to leverage rich textual semantics as an auxiliary modality to achieve effective domain adaptation.<n>We evaluate the proposed method on common cross-domain object detection benchmarks and demonstrate that it significantly surpasses existing few-shot object detection approaches.
arXiv Detail & Related papers (2025-02-23T06:59:22Z) - Detecting Misinformation in Multimedia Content through Cross-Modal Entity Consistency: A Dual Learning Approach [10.376378437321437]
We propose a Multimedia Misinformation Detection framework for detecting misinformation from video content by leveraging cross-modal entity consistency.
Our results demonstrate that MultiMD outperforms state-of-the-art baseline models.
arXiv Detail & Related papers (2024-08-16T16:14:36Z) - TCAN: Text-oriented Cross Attention Network for Multimodal Sentiment Analysis [34.28164104577455]
Multimodal Sentiment Analysis (MSA) endeavors to understand human sentiment by leveraging language, visual, and acoustic modalities.
Past research predominantly focused on improving representation learning techniques and feature fusion strategies.
We introduce a Text-oriented Cross-Attention Network (TCAN) emphasizing the predominant role of the text modality in MSA.
arXiv Detail & Related papers (2024-04-06T07:56:09Z) - Cross-domain Multi-modal Few-shot Object Detection via Rich Text [21.36633828492347]
Cross-modal feature extraction and integration have led to steady performance improvements in few-shot learning tasks.<n>We study the Cross-Domain few-shot generalization of MM-OD (CDMM-FSOD) and propose a meta-learning based multi-modal few-shot object detection method.
arXiv Detail & Related papers (2024-03-24T15:10:22Z) - WisdoM: Improving Multimodal Sentiment Analysis by Fusing Contextual
World Knowledge [73.76722241704488]
We propose a plug-in framework named WisdoM to leverage the contextual world knowledge induced from the large vision-language models (LVLMs) for enhanced multimodal sentiment analysis.
We show that our approach has substantial improvements over several state-of-the-art methods.
arXiv Detail & Related papers (2024-01-12T16:08:07Z) - From Text to Pixels: A Context-Aware Semantic Synergy Solution for
Infrared and Visible Image Fusion [66.33467192279514]
We introduce a text-guided multi-modality image fusion method that leverages the high-level semantics from textual descriptions to integrate semantics from infrared and visible images.
Our method not only produces visually superior fusion results but also achieves a higher detection mAP over existing methods, achieving state-of-the-art results.
arXiv Detail & Related papers (2023-12-31T08:13:47Z) - Detecting and Grounding Multi-Modal Media Manipulation and Beyond [93.08116982163804]
We highlight a new research problem for multi-modal fake media, namely Detecting and Grounding Multi-Modal Media Manipulation (DGM4)
DGM4 aims to not only detect the authenticity of multi-modal media, but also ground the manipulated content.
We propose a novel HierArchical Multi-modal Manipulation rEasoning tRansformer (HAMMER) to fully capture the fine-grained interaction between different modalities.
arXiv Detail & Related papers (2023-09-25T15:05:46Z) - Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z) - MISA: Modality-Invariant and -Specific Representations for Multimodal
Sentiment Analysis [48.776247141839875]
We propose a novel framework, MISA, which projects each modality to two distinct subspaces.
The first subspace is modality-invariant, where the representations across modalities learn their commonalities and reduce the modality gap.
Our experiments on popular sentiment analysis benchmarks, MOSI and MOSEI, demonstrate significant gains over state-of-the-art models.
arXiv Detail & Related papers (2020-05-07T15:13:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.