CAMU: Context Augmentation for Meme Understanding
- URL: http://arxiv.org/abs/2504.17902v1
- Date: Thu, 24 Apr 2025 19:27:55 GMT
- Title: CAMU: Context Augmentation for Meme Understanding
- Authors: Girish A. Koushik, Diptesh Kanojia, Helen Treharne, Aditya Joshi,
- Abstract summary: Social media memes are a challenging domain for hate detection because they intertwine visual and textual cues into culturally nuanced messages.<n>We introduce a novel framework, CAMU, which leverages large vision-language models to generate more descriptive captions.<n>Our approach attains high accuracy (0.807) and F1-score (0.806) on the Hateful Memes dataset, at par with the existing SoTA framework.
- Score: 9.49890289676001
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Social media memes are a challenging domain for hate detection because they intertwine visual and textual cues into culturally nuanced messages. We introduce a novel framework, CAMU, which leverages large vision-language models to generate more descriptive captions, a caption-scoring neural network to emphasise hate-relevant content, and parameter-efficient fine-tuning of CLIP's text encoder for an improved multimodal understanding of memes. Experiments on publicly available hateful meme datasets show that simple projection layer fine-tuning yields modest gains, whereas selectively tuning deeper text encoder layers significantly boosts performance on all evaluation metrics. Moreover, our approach attains high accuracy (0.807) and F1-score (0.806) on the Hateful Memes dataset, at par with the existing SoTA framework while being much more efficient, offering practical advantages in real-world scenarios that rely on fixed decision thresholds. CAMU also achieves the best F1-score of 0.673 on the MultiOFF dataset for offensive meme identification, demonstrating its generalisability. Additional analyses on benign confounders reveal that robust visual grounding and nuanced text representations are crucial for reliable hate and offence detection. We will publicly release CAMU along with the resultant models for further research. Disclaimer: This paper includes references to potentially disturbing, hateful, or offensive content due to the nature of the task.
Related papers
- Detecting and Understanding Hateful Contents in Memes Through Captioning and Visual Question-Answering [0.5587293092389789]
Hateful memes often evade traditional text-only or image-only detection systems, particularly when they employ subtle or coded references.<n>We propose a multimodal hate detection framework that integrates OCR to extract embedded text, captioning to describe visual content neutrally, sub-label classification for granular categorization, RAG for contextually relevant retrieval, and VQA for iterative analysis of symbolic and contextual cues.<n> Experimental results on the Facebook Hateful Memes dataset reveal that the proposed framework exceeds the performance of unimodal and conventional multimodal models in both accuracy and AUC-ROC.
arXiv Detail & Related papers (2025-04-23T13:52:14Z) - TULIP: Towards Unified Language-Image Pretraining [60.99500935831526]
We introduce T, an open-source, drop-in replacement for existing CLIP-like models.<n>Our method leverages generative data augmentation, enhanced image-image and text-text contrastive learning, and image/text reconstruction regularization to learn fine-grained visual features.<n>Our approach, scaling to over 1B parameters, outperforms existing state-of-the-art (SOTA) models across benchmarks.
arXiv Detail & Related papers (2025-03-19T17:58:57Z) - Knowing Where to Focus: Attention-Guided Alignment for Text-based Person Search [64.15205542003056]
We introduce Attention-Guided Alignment (AGA) framework featuring two innovative components: Attention-Guided Mask (AGM) Modeling and Text Enrichment Module (TEM)<n>AGA achieves new state-of-the-art results with Rank-1 accuracy reaching 78.36%, 67.31%, and 67.4% on CUHK-PEDES, ICFG-PEDES, and RSTP, respectively.
arXiv Detail & Related papers (2024-12-19T17:51:49Z) - Just KIDDIN: Knowledge Infusion and Distillation for Detection of INdecent Memes [8.42736066868944]
We propose a novel framework that integrates Knowledge Distillation (KD) from Large Visual Language Models (LVLMs) and knowledge infusion to enhance the performance of toxicity detection in hateful memes.
Our approach extracts sub-knowledge graphs from ConceptNet, a large-scale commonsense Knowledge Graph (KG) to be infused within a compact VLM framework.
Experimental results from our study on two hate speech benchmark datasets demonstrate superior performance over the state-of-the-art baselines.
arXiv Detail & Related papers (2024-11-19T02:39:28Z) - NeKo: Toward Post Recognition Generative Correction Large Language Models with Task-Oriented Experts [57.53692236201343]
We propose a Multi-Task Correction MoE, where we train the experts to become an expert'' of speech-to-text, language-to-text and vision-to-text datasets.
NeKo performs competitively on grammar and post-OCR correction as a multi-task model.
arXiv Detail & Related papers (2024-11-08T20:11:24Z) - MemeCLIP: Leveraging CLIP Representations for Multimodal Meme Classification [11.270267165348626]
We introduce a novel dataset PrideMM comprising 5,063 text-embedded images associated with the LGBTQ+ Pride movement.
We propose a novel framework MemeCLIP for efficient downstream learning while preserving the knowledge of the pre-trained CLIP model.
arXiv Detail & Related papers (2024-09-23T04:49:08Z) - HateSieve: A Contrastive Learning Framework for Detecting and Segmenting Hateful Content in Multimodal Memes [8.97062933976566]
textscHateSieve is a framework designed to enhance the detection and segmentation of hateful elements in memes.
textscHateSieve features a novel Contrastive Meme Generator that creates semantically paired memes.
Empirical experiments on the Hateful Meme show that textscHateSieve not only surpasses existing LMMs in performance with fewer trainable parameters but also offers a robust mechanism for precisely identifying and isolating hateful content.
arXiv Detail & Related papers (2024-08-11T14:56:06Z) - Multimodal Unlearnable Examples: Protecting Data against Multimodal Contrastive Learning [53.766434746801366]
Multimodal contrastive learning (MCL) has shown remarkable advances in zero-shot classification by learning from millions of image-caption pairs crawled from the Internet.
Hackers may unauthorizedly exploit image-text data for model training, potentially including personal and privacy-sensitive information.
Recent works propose generating unlearnable examples by adding imperceptible perturbations to training images to build shortcuts for protection.
We propose Multi-step Error Minimization (MEM), a novel optimization process for generating multimodal unlearnable examples.
arXiv Detail & Related papers (2024-07-23T09:00:52Z) - Improving Sampling Methods for Fine-tuning SentenceBERT in Text Streams [49.3179290313959]
This study explores the efficacy of seven text sampling methods designed to selectively fine-tune language models.
We precisely assess the impact of these methods on fine-tuning the SBERT model using four different loss functions.
Our findings indicate that Softmax loss and Batch All Triplets loss are particularly effective for text stream classification.
arXiv Detail & Related papers (2024-03-18T23:41:52Z) - Text or Image? What is More Important in Cross-Domain Generalization
Capabilities of Hate Meme Detection Models? [2.4899077941924967]
This paper delves into the formidable challenge of cross-domain generalization in multimodal hate meme detection.
We provide enough pieces of evidence supporting the hypothesis that only the textual component of hateful memes enables the existing multimodal classifier to generalize across different domains.
Our evaluation on a newly created confounder dataset reveals higher performance on text confounders as compared to image confounders with an average $Delta$F1 of 0.18.
arXiv Detail & Related papers (2024-02-07T15:44:55Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.