Detecting and Understanding Hateful Contents in Memes Through Captioning and Visual Question-Answering
- URL: http://arxiv.org/abs/2504.16723v1
- Date: Wed, 23 Apr 2025 13:52:14 GMT
- Title: Detecting and Understanding Hateful Contents in Memes Through Captioning and Visual Question-Answering
- Authors: Ali Anaissi, Junaid Akram, Kunal Chaturvedi, Ali Braytee,
- Abstract summary: Hateful memes often evade traditional text-only or image-only detection systems, particularly when they employ subtle or coded references.<n>We propose a multimodal hate detection framework that integrates OCR to extract embedded text, captioning to describe visual content neutrally, sub-label classification for granular categorization, RAG for contextually relevant retrieval, and VQA for iterative analysis of symbolic and contextual cues.<n> Experimental results on the Facebook Hateful Memes dataset reveal that the proposed framework exceeds the performance of unimodal and conventional multimodal models in both accuracy and AUC-ROC.
- Score: 0.5587293092389789
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Memes are widely used for humor and cultural commentary, but they are increasingly exploited to spread hateful content. Due to their multimodal nature, hateful memes often evade traditional text-only or image-only detection systems, particularly when they employ subtle or coded references. To address these challenges, we propose a multimodal hate detection framework that integrates key components: OCR to extract embedded text, captioning to describe visual content neutrally, sub-label classification for granular categorization of hateful content, RAG for contextually relevant retrieval, and VQA for iterative analysis of symbolic and contextual cues. This enables the framework to uncover latent signals that simpler pipelines fail to detect. Experimental results on the Facebook Hateful Memes dataset reveal that the proposed framework exceeds the performance of unimodal and conventional multimodal models in both accuracy and AUC-ROC.
Related papers
- CAMU: Context Augmentation for Meme Understanding [9.49890289676001]
Social media memes are a challenging domain for hate detection because they intertwine visual and textual cues into culturally nuanced messages.
We introduce a novel framework, CAMU, which leverages large vision-language models to generate more descriptive captions.
Our approach attains high accuracy (0.807) and F1-score (0.806) on the Hateful Memes dataset, at par with the existing SoTA framework.
arXiv Detail & Related papers (2025-04-24T19:27:55Z) - Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - Towards a Robust Framework for Multimodal Hate Detection: A Study on Video vs. Image-based Content [7.5253808885104325]
Social media platforms enable the propagation of hateful content across different modalities.<n>Recent approaches have shown promise in handling individual modalities, but their effectiveness across different modality combinations remains unexplored.<n>This paper presents a systematic analysis of fusion-based approaches for multimodal hate detection, focusing on their performance across video and image-based content.
arXiv Detail & Related papers (2025-02-11T00:07:40Z) - RSRefSeg: Referring Remote Sensing Image Segmentation with Foundation Models [24.67117013862316]
Referring remote sensing image segmentation is crucial for achieving fine-grained visual understanding.<n>We introduce a referring remote sensing image segmentation foundational model, RSRefSeg.<n> Experimental results on the RRSIS-D dataset demonstrate that RSRefSeg outperforms existing methods.
arXiv Detail & Related papers (2025-01-12T13:22:35Z) - Knowing Where to Focus: Attention-Guided Alignment for Text-based Person Search [64.15205542003056]
We introduce Attention-Guided Alignment (AGA) framework featuring two innovative components: Attention-Guided Mask (AGM) Modeling and Text Enrichment Module (TEM)<n>AGA achieves new state-of-the-art results with Rank-1 accuracy reaching 78.36%, 67.31%, and 67.4% on CUHK-PEDES, ICFG-PEDES, and RSTP, respectively.
arXiv Detail & Related papers (2024-12-19T17:51:49Z) - EAGLE: Towards Efficient Arbitrary Referring Visual Prompts Comprehension for Multimodal Large Language Models [80.00303150568696]
We propose a novel Multimodal Large Language Models (MLLM) that empowers comprehension of arbitrary referring visual prompts with less training efforts than existing approaches.
Our approach embeds referring visual prompts as spatial concepts conveying specific spatial areas comprehensible to the MLLM.
We also propose a Geometry-Agnostic Learning paradigm (GAL) to further disentangle the MLLM's region-level comprehension with the specific formats of referring visual prompts.
arXiv Detail & Related papers (2024-09-25T08:22:00Z) - HateSieve: A Contrastive Learning Framework for Detecting and Segmenting Hateful Content in Multimodal Memes [8.97062933976566]
textscHateSieve is a framework designed to enhance the detection and segmentation of hateful elements in memes.
textscHateSieve features a novel Contrastive Meme Generator that creates semantically paired memes.
Empirical experiments on the Hateful Meme show that textscHateSieve not only surpasses existing LMMs in performance with fewer trainable parameters but also offers a robust mechanism for precisely identifying and isolating hateful content.
arXiv Detail & Related papers (2024-08-11T14:56:06Z) - Align before Attend: Aligning Visual and Textual Features for Multimodal
Hateful Content Detection [4.997673761305336]
This paper proposes a context-aware attention framework for multimodal hateful content detection.
We evaluate the proposed approach on two benchmark hateful meme datasets, viz. MUTE (Bengali code-mixed) and MultiOFF (English)
arXiv Detail & Related papers (2024-02-15T06:34:15Z) - Improving Vision Anomaly Detection with the Guidance of Language
Modality [64.53005837237754]
This paper tackles the challenges for vision modality from a multimodal point of view.
We propose Cross-modal Guidance (CMG) to tackle the redundant information issue and sparse space issue.
To learn a more compact latent space for the vision anomaly detector, CMLE learns a correlation structure matrix from the language modality.
arXiv Detail & Related papers (2023-10-04T13:44:56Z) - Collaborative Group: Composed Image Retrieval via Consensus Learning from Noisy Annotations [67.92679668612858]
We propose the Consensus Network (Css-Net), inspired by the psychological concept that groups outperform individuals.
Css-Net comprises two core components: (1) a consensus module with four diverse compositors, each generating distinct image-text embeddings; and (2) a Kullback-Leibler divergence loss that encourages learning of inter-compositor interactions.
On benchmark datasets, particularly FashionIQ, Css-Net demonstrates marked improvements. Notably, it achieves significant recall gains, with a 2.77% increase in R@10 and 6.67% boost in R@50, underscoring its
arXiv Detail & Related papers (2023-06-03T11:50:44Z) - PV2TEA: Patching Visual Modality to Textual-Established Information
Extraction [59.76117533540496]
We patch the visual modality to the textual-established attribute information extractor.
PV2TEA is an encoder-decoder architecture equipped with three bias reduction schemes.
Empirical results on real-world e-Commerce datasets demonstrate up to 11.74% absolute (20.97% relatively) F1 increase over unimodal baselines.
arXiv Detail & Related papers (2023-06-01T05:39:45Z) - Hateful Memes Detection via Complementary Visual and Linguistic Networks [4.229588654547344]
We investigate a solution based on complementary visual and linguistic network in Hateful Memes Challenge 2020.
Both contextual-level and sensitive object-level information are considered in visual and linguistic embedding.
Results show that CVL provides a decent performance, and produces 78:48% and 72:95% on the criteria of AUROC and Accuracy.
arXiv Detail & Related papers (2020-12-09T11:11:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.