Hate-CLIPper: Multimodal Hateful Meme Classification based on
Cross-modal Interaction of CLIP Features
- URL: http://arxiv.org/abs/2210.05916v2
- Date: Thu, 13 Oct 2022 07:20:23 GMT
- Title: Hate-CLIPper: Multimodal Hateful Meme Classification based on
Cross-modal Interaction of CLIP Features
- Authors: Gokul Karthik Kumar, Karthik Nandakumar
- Abstract summary: Hateful memes are a growing menace on social media.
detecting hateful memes requires careful consideration of both visual and textual information.
We propose the Hate-CLIPper architecture, which explicitly models the cross-modal interactions between the image and text representations.
- Score: 5.443781798915199
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Hateful memes are a growing menace on social media. While the image and its
corresponding text in a meme are related, they do not necessarily convey the
same meaning when viewed individually. Hence, detecting hateful memes requires
careful consideration of both visual and textual information. Multimodal
pre-training can be beneficial for this task because it effectively captures
the relationship between the image and the text by representing them in a
similar feature space. Furthermore, it is essential to model the interactions
between the image and text features through intermediate fusion. Most existing
methods either employ multimodal pre-training or intermediate fusion, but not
both. In this work, we propose the Hate-CLIPper architecture, which explicitly
models the cross-modal interactions between the image and text representations
obtained using Contrastive Language-Image Pre-training (CLIP) encoders via a
feature interaction matrix (FIM). A simple classifier based on the FIM
representation is able to achieve state-of-the-art performance on the Hateful
Memes Challenge (HMC) dataset with an AUROC of 85.8, which even surpasses the
human performance of 82.65. Experiments on other meme datasets such as
Propaganda Memes and TamilMemes also demonstrate the generalizability of the
proposed approach. Finally, we analyze the interpretability of the FIM
representation and show that cross-modal interactions can indeed facilitate the
learning of meaningful concepts. The code for this work is available at
https://github.com/gokulkarthik/hateclipper.
Related papers
- XMeCap: Meme Caption Generation with Sub-Image Adaptability [53.2509590113364]
Humor, deeply rooted in societal meanings and cultural details, poses a unique challenge for machines.
We introduce the textscXMeCap framework, which adopts supervised fine-tuning and reinforcement learning.
textscXMeCap achieves an average evaluation score of 75.85 for single-image memes and 66.32 for multi-image memes, outperforming the best baseline by 3.71% and 4.82%, respectively.
arXiv Detail & Related papers (2024-07-24T10:51:46Z) - Improving Human-Object Interaction Detection via Virtual Image Learning [68.56682347374422]
Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects.
In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL)
A novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images.
arXiv Detail & Related papers (2023-08-04T10:28:48Z) - Correlational Image Modeling for Self-Supervised Visual Pre-Training [81.82907503764775]
Correlational Image Modeling is a novel and surprisingly effective approach to self-supervised visual pre-training.
Three key designs enable correlational image modeling as a nontrivial and meaningful self-supervisory task.
arXiv Detail & Related papers (2023-03-22T15:48:23Z) - Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image
Person Retrieval [29.884153827619915]
We present IRRA: a cross-modal Implicit Relation Reasoning and Aligning framework.
It learns relations between local visual-textual tokens and enhances global image-text matching.
The proposed method achieves new state-of-the-art results on all three public datasets.
arXiv Detail & Related papers (2023-03-22T12:11:59Z) - Towards Unifying Medical Vision-and-Language Pre-training via Soft
Prompts [63.84720380390935]
There exist two typical types, textiti.e., the fusion-encoder type and the dual-encoder type, depending on whether a heavy fusion module is used.
We propose an effective yet straightforward scheme named PTUnifier to unify the two types.
We first unify the input format by introducing visual and textual prompts, which serve as a feature bank that stores the most representative images/texts.
arXiv Detail & Related papers (2023-02-17T15:43:42Z) - Multi-Granularity Cross-Modality Representation Learning for Named
Entity Recognition on Social Media [11.235498285650142]
Named Entity Recognition (NER) on social media refers to discovering and classifying entities from unstructured free-form content.
This work introduces the multi-granularity cross-modality representation learning.
Experiments show that our proposed approach can achieve the SOTA or approximate SOTA performance on two benchmark datasets of tweets.
arXiv Detail & Related papers (2022-10-19T15:14:55Z) - ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text
Pre-training [40.05046655477684]
ERNIE-ViL 2.0 is a Multi-View Contrastive learning framework to build intra-modal and inter-modal correlations between diverse views simultaneously.
We construct sequences of object tags as a special textual view to narrow the cross-modal semantic gap on noisy image-text pairs.
ERNIE-ViL 2.0 achieves competitive results on English cross-modal retrieval.
arXiv Detail & Related papers (2022-09-30T07:20:07Z) - ARMANI: Part-level Garment-Text Alignment for Unified Cross-Modal
Fashion Design [66.68194916359309]
Cross-modal fashion image synthesis has emerged as one of the most promising directions in the generation domain.
MaskCLIP decomposes the garments into semantic parts, ensuring fine-grained and semantically accurate alignment between the visual and text information.
ArmANI discretizes an image into uniform tokens based on a learned cross-modal codebook in its first stage and uses a Transformer to model the distribution of image tokens for a real image.
arXiv Detail & Related papers (2022-08-11T03:44:02Z) - MemeTector: Enforcing deep focus for meme detection [8.794414326545697]
It is important to accurately retrieve image memes from social media to better capture the cultural and social aspects of online phenomena.
We propose a methodology that utilizes the visual part of image memes as instances of the regular image class and the initial image memes.
We employ a trainable attention mechanism on top of a standard ViT architecture to enhance the model's ability to focus on these critical parts.
arXiv Detail & Related papers (2022-05-26T10:50:29Z) - Vision-Language Pre-Training with Triple Contrastive Learning [45.80365827890119]
We propose triple contrastive learning (TCL) for vision-language pre-training by leveraging both cross-modal and intra-modal self-supervision.
Ours is the first work that takes into account local structure information for multi-modality representation learning.
arXiv Detail & Related papers (2022-02-21T17:54:57Z) - FILIP: Fine-grained Interactive Language-Image Pre-Training [106.19474076935363]
Fine-grained Interactive Language-Image Pre-training achieves finer-level alignment through a cross-modal late interaction mechanism.
We construct a new large-scale image-text pair dataset called FILIP300M for pre-training.
Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-11-09T17:15:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.