MemeFier: Dual-stage Modality Fusion for Image Meme Classification
- URL: http://arxiv.org/abs/2304.02906v2
- Date: Fri, 7 Apr 2023 06:57:42 GMT
- Title: MemeFier: Dual-stage Modality Fusion for Image Meme Classification
- Authors: Christos Koutlis, Manos Schinas, Symeon Papadopoulos
- Abstract summary: New forms of digital content such as image memes have given rise to spread of hate using multimodal means.
We propose MemeFier, a deep learning-based architecture for fine-grained classification of Internet image memes.
- Score: 8.794414326545697
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Hate speech is a societal problem that has significantly grown through the
Internet. New forms of digital content such as image memes have given rise to
spread of hate using multimodal means, being far more difficult to analyse and
detect compared to the unimodal case. Accurate automatic processing, analysis
and understanding of this kind of content will facilitate the endeavor of
hindering hate speech proliferation through the digital world. To this end, we
propose MemeFier, a deep learning-based architecture for fine-grained
classification of Internet image memes, utilizing a dual-stage modality fusion
module. The first fusion stage produces feature vectors containing modality
alignment information that captures non-trivial connections between the text
and image of a meme. The second fusion stage leverages the power of a
Transformer encoder to learn inter-modality correlations at the token level and
yield an informative representation. Additionally, we consider external
knowledge as an additional input, and background image caption supervision as a
regularizing component. Extensive experiments on three widely adopted
benchmarks, i.e., Facebook Hateful Memes, Memotion7k and MultiOFF, indicate
that our approach competes and in some cases surpasses state-of-the-art. Our
code is available on https://github.com/ckoutlis/memefier.
Related papers
- XMeCap: Meme Caption Generation with Sub-Image Adaptability [53.2509590113364]
Humor, deeply rooted in societal meanings and cultural details, poses a unique challenge for machines.
We introduce the textscXMeCap framework, which adopts supervised fine-tuning and reinforcement learning.
textscXMeCap achieves an average evaluation score of 75.85 for single-image memes and 66.32 for multi-image memes, outperforming the best baseline by 3.71% and 4.82%, respectively.
arXiv Detail & Related papers (2024-07-24T10:51:46Z) - Text or Image? What is More Important in Cross-Domain Generalization
Capabilities of Hate Meme Detection Models? [2.4899077941924967]
This paper delves into the formidable challenge of cross-domain generalization in multimodal hate meme detection.
We provide enough pieces of evidence supporting the hypothesis that only the textual component of hateful memes enables the existing multimodal classifier to generalize across different domains.
Our evaluation on a newly created confounder dataset reveals higher performance on text confounders as compared to image confounders with an average $Delta$F1 of 0.18.
arXiv Detail & Related papers (2024-02-07T15:44:55Z) - Meme-ingful Analysis: Enhanced Understanding of Cyberbullying in Memes
Through Multimodal Explanations [48.82168723932981]
We introduce em MultiBully-Ex, the first benchmark dataset for multimodal explanation from code-mixed cyberbullying memes.
A Contrastive Language-Image Pretraining (CLIP) approach has been proposed for visual and textual explanation of a meme.
arXiv Detail & Related papers (2024-01-18T11:24:30Z) - Meta-Transformer: A Unified Framework for Multimodal Learning [105.77219833997962]
Multimodal learning aims to build models that process and relate information from multiple modalities.
Despite years of development in this field, it still remains challenging to design a unified network for processing various modalities.
We propose a framework, named Meta-Transformer, that leverages a $textbffrozen$ encoder to perform multimodal perception.
arXiv Detail & Related papers (2023-07-20T12:10:29Z) - MEMEX: Detecting Explanatory Evidence for Memes via Knowledge-Enriched
Contextualization [31.209594252045566]
We propose a novel task, MEMEX, given a meme and a related document, the aim is to mine the context that succinctly explains the background of the meme.
To benchmark MCC, we propose MIME, a multimodal neural framework that uses common sense enriched meme representation and a layered approach to capture the cross-modal semantic dependencies between the meme and the context.
arXiv Detail & Related papers (2023-05-25T10:19:35Z) - Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image
Retrieval [55.21569389894215]
We propose a cross-attention framework for Vision Transformers (XModalViT) that fuses modality-specific information instead of discarding them.
Our framework first maps paired datapoints from the individual photo and sketch modalities to fused representations that unify information from both modalities.
We then decouple the input space of the aforementioned modality fusion network into independent encoders of the individual modalities via contrastive and relational cross-modal knowledge distillation.
arXiv Detail & Related papers (2022-10-19T11:50:14Z) - Hate-CLIPper: Multimodal Hateful Meme Classification based on
Cross-modal Interaction of CLIP Features [5.443781798915199]
Hateful memes are a growing menace on social media.
detecting hateful memes requires careful consideration of both visual and textual information.
We propose the Hate-CLIPper architecture, which explicitly models the cross-modal interactions between the image and text representations.
arXiv Detail & Related papers (2022-10-12T04:34:54Z) - Feels Bad Man: Dissecting Automated Hateful Meme Detection Through the
Lens of Facebook's Challenge [10.775419935941008]
We assess the efficacy of current state-of-the-art multimodal machine learning models toward hateful meme detection.
We use two benchmark datasets comprising 12,140 and 10,567 images from 4chan's "Politically Incorrect" board (/pol/) and Facebook's Hateful Memes Challenge dataset.
We conduct three experiments to determine the importance of multimodality on classification performance, the influential capacity of fringe Web communities on mainstream social platforms and vice versa.
arXiv Detail & Related papers (2022-02-17T07:52:22Z) - Caption Enriched Samples for Improving Hateful Memes Detection [78.5136090997431]
The hateful meme challenge demonstrates the difficulty of determining whether a meme is hateful or not.
Both unimodal language models and multimodal vision-language models cannot reach the human level of performance.
arXiv Detail & Related papers (2021-09-22T10:57:51Z) - Exploiting BERT For Multimodal Target SentimentClassification Through
Input Space Translation [75.82110684355979]
We introduce a two-stream model that translates images in input space using an object-aware transformer.
We then leverage the translation to construct an auxiliary sentence that provides multimodal information to a language model.
We achieve state-of-the-art performance on two multimodal Twitter datasets.
arXiv Detail & Related papers (2021-08-03T18:02:38Z) - Detecting Hate Speech in Multi-modal Memes [14.036769355498546]
We focus on hate speech detection in multi-modal memes wherein memes pose an interesting multi-modal fusion problem.
We aim to solve the Facebook Meme Challenge citekiela 2020hateful which aims to solve a binary classification problem of predicting whether a meme is hateful or not.
arXiv Detail & Related papers (2020-12-29T18:30:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.