TRACE: Textual Relevance Augmentation and Contextual Encoding for Multimodal Hate Detection
- URL: http://arxiv.org/abs/2504.17902v2
- Date: Fri, 07 Nov 2025 18:41:03 GMT
- Title: TRACE: Textual Relevance Augmentation and Contextual Encoding for Multimodal Hate Detection
- Authors: Girish A. Koushik, Helen Treharne, Aditya Joshi, Diptesh Kanojia,
- Abstract summary: Social media memes are a challenging domain for hate detection because they intertwine visual and textual cues into culturally nuanced messages.<n>To tackle these challenges, we introduce TRACE, a hierarchical multimodal framework that leverages visually grounded context augmentation.<n>Our framework achieves state-of-the-art accuracy (0.807) and F1-score (0.806) on the widely-used Hateful Memes dataset.
- Score: 15.240092636523277
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Social media memes are a challenging domain for hate detection because they intertwine visual and textual cues into culturally nuanced messages. To tackle these challenges, we introduce TRACE, a hierarchical multimodal framework that leverages visually grounded context augmentation, along with a novel caption-scoring network to emphasize hate-relevant content, and parameter-efficient fine-tuning of CLIP's text encoder. Our experiments demonstrate that selectively fine-tuning deeper text encoder layers significantly enhances performance compared to simpler projection-layer fine-tuning methods. Specifically, our framework achieves state-of-the-art accuracy (0.807) and F1-score (0.806) on the widely-used Hateful Memes dataset, matching the performance of considerably larger models while maintaining efficiency. Moreover, it achieves superior generalization on the MultiOFF offensive meme dataset (F1-score 0.673), highlighting robustness across meme categories. Additional analyses confirm that robust visual grounding and nuanced text representations significantly reduce errors caused by benign confounders. We publicly release our code to facilitate future research.
Related papers
- TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering [76.53315206999231]
TextPecker is a plug-and-play structural anomaly perceptive RL strategy.<n>It mitigates noisy reward signals and works with any textto-image generators.<n>It significantly yields average gains of 4% in structural fidelity and 8.7% in semantic alignment for Chinese text rendering.
arXiv Detail & Related papers (2026-02-24T13:40:23Z) - Labels or Input? Rethinking Augmentation in Multimodal Hate Detection [9.166963162285064]
We present a dual-pronged approach to improve multimodal hate detection.<n>First, we propose a prompt optimization framework that systematically varies prompt structure, supervision, and training modality.<n>Second, we introduce a multimodal data augmentation pipeline that generates 2,479 counterfactually neutral memes.
arXiv Detail & Related papers (2025-08-15T21:31:00Z) - ContextGuard-LVLM: Enhancing News Veracity through Fine-grained Cross-modal Contextual Consistency Verification [2.012425476229879]
Traditional approaches fall short in addressing the fine-grained cross-modal contextual consistency problem.<n>We propose ContextGuard-LVLM, a novel framework built upon advanced Vision-Language Large Models.<n>Our model is uniquely enhanced through reinforced or adversarial learning paradigms.
arXiv Detail & Related papers (2025-08-08T18:10:24Z) - Mitigating Hallucination of Large Vision-Language Models via Dynamic Logits Calibration [8.192590936983347]
Large Vision-Language Models (LVLMs) have demonstrated significant advancements in multimodal understanding.<n>They are frequently hampered by hallucination-the generation of text that contradicts visual input.<n>Existing training-free decoding strategies exhibit critical limitations.<n>This paper introduces Dynamic Logits (DLC), a novel training-free decoding framework designed to align text generation with visual evidence at inference time.
arXiv Detail & Related papers (2025-06-26T17:35:40Z) - Detecting Harmful Memes with Decoupled Understanding and Guided CoT Reasoning [26.546646866501735]
We introduce U-CoT+, a novel framework for harmful meme detection.<n>We first develop a high-fidelity meme-to-text pipeline that converts visual memes into detail-preserving textual descriptions.<n>This design decouples meme interpretation from meme classification, thus avoiding immediate reasoning over complex raw visual content.
arXiv Detail & Related papers (2025-06-10T06:10:45Z) - Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning [122.81815833343026]
We introduce Point-RFT, a multimodal reasoning framework explicitly designed to leverage visually grounded CoT reasoning for visual document understanding.<n>Our approach consists of two stages: First, we conduct format finetuning using a curated dataset of 71K diverse visual reasoning problems, each annotated with detailed, step-by-step rationales explicitly grounded to corresponding visual elements.<n>On ChartQA, our approach improves accuracy from 70.88% (language-finetuned baseline) to 90.04%, surpassing the 83.92% accuracy achieved by reinforcement finetuning relying solely on text-based CoT.
arXiv Detail & Related papers (2025-05-26T08:54:14Z) - Learning Interpretable Representations Leads to Semantically Faithful EEG-to-Text Generation [52.51005875755718]
We focus on EEG-to-text decoding and address its hallucination issue through the lens of posterior collapse.<n>Acknowledging the underlying mismatch in information capacity between EEG and text, we reframe the decoding task as semantic summarization of core meanings.<n>Experiments on the public ZuCo dataset demonstrate that GLIM consistently generates fluent, EEG-grounded sentences.
arXiv Detail & Related papers (2025-05-21T05:29:55Z) - Detecting and Understanding Hateful Contents in Memes Through Captioning and Visual Question-Answering [0.5587293092389789]
Hateful memes often evade traditional text-only or image-only detection systems, particularly when they employ subtle or coded references.<n>We propose a multimodal hate detection framework that integrates OCR to extract embedded text, captioning to describe visual content neutrally, sub-label classification for granular categorization, RAG for contextually relevant retrieval, and VQA for iterative analysis of symbolic and contextual cues.<n> Experimental results on the Facebook Hateful Memes dataset reveal that the proposed framework exceeds the performance of unimodal and conventional multimodal models in both accuracy and AUC-ROC.
arXiv Detail & Related papers (2025-04-23T13:52:14Z) - TULIP: Towards Unified Language-Image Pretraining [60.99500935831526]
We introduce T, an open-source, drop-in replacement for existing CLIP-like models.<n>Our method leverages generative data augmentation, enhanced image-image and text-text contrastive learning, and image/text reconstruction regularization to learn fine-grained visual features.<n>Our approach, scaling to over 1B parameters, outperforms existing state-of-the-art (SOTA) models across benchmarks.
arXiv Detail & Related papers (2025-03-19T17:58:57Z) - Knowing Where to Focus: Attention-Guided Alignment for Text-based Person Search [64.15205542003056]
We introduce Attention-Guided Alignment (AGA) framework featuring two innovative components: Attention-Guided Mask (AGM) Modeling and Text Enrichment Module (TEM)<n>AGA achieves new state-of-the-art results with Rank-1 accuracy reaching 78.36%, 67.31%, and 67.4% on CUHK-PEDES, ICFG-PEDES, and RSTP, respectively.
arXiv Detail & Related papers (2024-12-19T17:51:49Z) - Just KIDDIN: Knowledge Infusion and Distillation for Detection of INdecent Memes [8.42736066868944]
We propose a novel framework that integrates Knowledge Distillation (KD) from Large Visual Language Models (LVLMs) and knowledge infusion to enhance the performance of toxicity detection in hateful memes.
Our approach extracts sub-knowledge graphs from ConceptNet, a large-scale commonsense Knowledge Graph (KG) to be infused within a compact VLM framework.
Experimental results from our study on two hate speech benchmark datasets demonstrate superior performance over the state-of-the-art baselines.
arXiv Detail & Related papers (2024-11-19T02:39:28Z) - NeKo: Toward Post Recognition Generative Correction Large Language Models with Task-Oriented Experts [57.53692236201343]
We propose a Multi-Task Correction MoE, where we train the experts to become an expert'' of speech-to-text, language-to-text and vision-to-text datasets.
NeKo performs competitively on grammar and post-OCR correction as a multi-task model.
arXiv Detail & Related papers (2024-11-08T20:11:24Z) - MemeCLIP: Leveraging CLIP Representations for Multimodal Meme Classification [11.270267165348626]
We introduce a novel dataset PrideMM comprising 5,063 text-embedded images associated with the LGBTQ+ Pride movement.
We propose a novel framework MemeCLIP for efficient downstream learning while preserving the knowledge of the pre-trained CLIP model.
arXiv Detail & Related papers (2024-09-23T04:49:08Z) - HateSieve: A Contrastive Learning Framework for Detecting and Segmenting Hateful Content in Multimodal Memes [8.97062933976566]
textscHateSieve is a framework designed to enhance the detection and segmentation of hateful elements in memes.
textscHateSieve features a novel Contrastive Meme Generator that creates semantically paired memes.
Empirical experiments on the Hateful Meme show that textscHateSieve not only surpasses existing LMMs in performance with fewer trainable parameters but also offers a robust mechanism for precisely identifying and isolating hateful content.
arXiv Detail & Related papers (2024-08-11T14:56:06Z) - Multimodal Unlearnable Examples: Protecting Data against Multimodal Contrastive Learning [53.766434746801366]
Multimodal contrastive learning (MCL) has shown remarkable advances in zero-shot classification by learning from millions of image-caption pairs crawled from the Internet.
Hackers may unauthorizedly exploit image-text data for model training, potentially including personal and privacy-sensitive information.
Recent works propose generating unlearnable examples by adding imperceptible perturbations to training images to build shortcuts for protection.
We propose Multi-step Error Minimization (MEM), a novel optimization process for generating multimodal unlearnable examples.
arXiv Detail & Related papers (2024-07-23T09:00:52Z) - Improving Sampling Methods for Fine-tuning SentenceBERT in Text Streams [49.3179290313959]
This study explores the efficacy of seven text sampling methods designed to selectively fine-tune language models.
We precisely assess the impact of these methods on fine-tuning the SBERT model using four different loss functions.
Our findings indicate that Softmax loss and Batch All Triplets loss are particularly effective for text stream classification.
arXiv Detail & Related papers (2024-03-18T23:41:52Z) - Text or Image? What is More Important in Cross-Domain Generalization
Capabilities of Hate Meme Detection Models? [2.4899077941924967]
This paper delves into the formidable challenge of cross-domain generalization in multimodal hate meme detection.
We provide enough pieces of evidence supporting the hypothesis that only the textual component of hateful memes enables the existing multimodal classifier to generalize across different domains.
Our evaluation on a newly created confounder dataset reveals higher performance on text confounders as compared to image confounders with an average $Delta$F1 of 0.18.
arXiv Detail & Related papers (2024-02-07T15:44:55Z) - IPAD: Iterative, Parallel, and Diffusion-based Network for Scene Text Recognition [5.525052547053668]
Scene text recognition has attracted more and more attention due to its diverse applications.<n>Most state-of-the-art methods adopt an encoder-decoder framework with the attention mechanism, autoregressively generating text from left to right.<n>We propose an alternative solution that uses a parallel and iterative decoder that adopts an easy-first decoding strategy.
arXiv Detail & Related papers (2023-12-19T08:03:19Z) - Enhancing Diffusion Models with Text-Encoder Reinforcement Learning [63.41513909279474]
Text-to-image diffusion models are typically trained to optimize the log-likelihood objective.
Recent research addresses this issue by refining the diffusion U-Net using human rewards through reinforcement learning or direct backpropagation.
We demonstrate that by finetuning the text encoder through reinforcement learning, we can enhance the text-image alignment of the results.
arXiv Detail & Related papers (2023-11-27T09:39:45Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.