Related papers: MemeLens: Multilingual Multitask VLMs for Memes

MemeLens: Multilingual Multitask VLMs for Memes

URL: http://arxiv.org/abs/2601.12539v1
Date: Sun, 18 Jan 2026 19:01:03 GMT
Title: MemeLens: Multilingual Multitask VLMs for Memes
Authors: Ali Ezzat Shahroor, Mohamed Bayan Kmainasi, Abul Hasnat, Dimitar Dimitrov, Giovanni Da San Martino, Preslav Nakov, Firoj Alam,
Abstract summary: We propose MemeLens, a unified multilingual and explanation-enhanced Vision Language Model (VLM) for meme understanding.<n>We consolidate 38 public meme datasets, filter and map dataset-specific labels into a shared taxonomy of $20$ tasks spanning harm, targets, figurative/pragmatic intent, and affect.<n>Our findings suggest that robust meme understanding requires multimodal training, exhibits substantial variation across semantic categories, and remains sensitive to over-specialization when models are fine-tuned on individual datasets rather than trained in a unified setting.
Score: 45.8232386994625
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Memes are a dominant medium for online communication and manipulation because meaning emerges from interactions between embedded text, imagery, and cultural context. Existing meme research is distributed across tasks (hate, misogyny, propaganda, sentiment, humour) and languages, which limits cross-domain generalization. To address this gap we propose MemeLens, a unified multilingual and multitask explanation-enhanced Vision Language Model (VLM) for meme understanding. We consolidate 38 public meme datasets, filter and map dataset-specific labels into a shared taxonomy of $20$ tasks spanning harm, targets, figurative/pragmatic intent, and affect. We present a comprehensive empirical analysis across modeling paradigms, task categories, and datasets. Our findings suggest that robust meme understanding requires multimodal training, exhibits substantial variation across semantic categories, and remains sensitive to over-specialization when models are fine-tuned on individual datasets rather than trained in a unified setting. We will make the experimental resources and datasets publicly available for the community.

Related papers

Large Vision-Language Models for Knowledge-Grounded Data Annotation of Memes [5.243460995467895]
This study introduces ClassicMemes-50-templates (CM50), a large-scale dataset consisting of over 33,000 memes, centered around 50 popular meme templates.<n>We also present an automated knowledge-grounded annotation pipeline leveraging large vision-language models to produce high-quality image captions, meme captions, and literary device labels.
arXiv Detail & Related papers (2025-01-23T17:18:30Z)
Multilingual Diversity Improves Vision-Language Representations [97.16233528393356]
Pre-training on this dataset outperforms using English-only or English-dominated datasets on ImageNet.<n>On a geographically diverse task like GeoDE, we also observe improvements across all regions, with the biggest gain coming from Africa.
arXiv Detail & Related papers (2024-05-27T08:08:51Z)
CIVICS: Building a Dataset for Examining Culturally-Informed Values in Large Language Models [59.22460740026037]
"CIVICS: Culturally-Informed & Values-Inclusive Corpus for Societal impacts" dataset is designed to evaluate the social and cultural variation of Large Language Models (LLMs) We create a hand-crafted, multilingual dataset of value-laden prompts which address specific socially sensitive topics, including LGBTQI rights, social welfare, immigration, disability rights, and surrogacy.
arXiv Detail & Related papers (2024-05-22T20:19:10Z)
PromptMTopic: Unsupervised Multimodal Topic Modeling of Memes using Large Language Models [7.388466146105024]
We propose textPromptMTopic, a novel multimodal prompt-based model to learn topics from both text and visual modalities. Our model effectively extracts and clusters topics learned from memes, considering the semantic interaction between the text and visual modalities. Our work contributes to the understanding of the topics and themes of memes, a crucial form of communication in today's society.
arXiv Detail & Related papers (2023-12-11T03:36:50Z)
How to Solve Few-Shot Abusive Content Detection Using the Data We Actually Have [58.23138483086277]
In this work we leverage datasets we already have, covering a wide range of tasks related to abusive language detection. Our goal is to build models cheaply for a new target label set and/or language, using only a few training examples of the target domain. Our experiments show that using already existing datasets and only a few-shots of the target task the performance of models improve both monolingually and across languages.
arXiv Detail & Related papers (2023-05-23T14:04:12Z)
SemiMemes: A Semi-supervised Learning Approach for Multimodal Memes Analysis [0.0]
SemiMemes is a novel training method that combines auto-encoder and classification task to make use of the resourceful unlabeled data. This research proposes a multimodal semi-supervised learning approach that outperforms other multimodal semi-supervised learning and supervised learning state-of-the-art models.
arXiv Detail & Related papers (2023-03-31T11:22:03Z)
Multimodal and Explainable Internet Meme Classification [3.4690152926833315]
We design and implement a modular and explainable architecture for Internet meme understanding. We study the relevance of our modular and explainable models in detecting harmful memes on two existing tasks: Hate Speech Detection and Misogyny Classification. We devise a user-friendly interface that facilitates the comparative analysis of examples retrieved by all of our models for any given meme.
arXiv Detail & Related papers (2022-12-11T21:52:21Z)
What do you MEME? Generating Explanations for Visual Semantic Role Labelling in Memes [42.357272117919464]
We introduce a novel task - EXCLAIM, generating explanations for visual semantic role labeling in memes. To this end, we curate ExHVV, a novel dataset that offers natural language explanations of connotative roles for three types of entities. We also posit LUMEN, a novel multimodal, multi-task learning framework that endeavors to address EXCLAIM optimally.
arXiv Detail & Related papers (2022-12-01T18:21:36Z)
Detecting and Understanding Harmful Memes: A Survey [48.135415967633676]
We offer a comprehensive survey with a focus on harmful memes. One interesting finding is that many types of harmful memes are not really studied, e.g., such featuring self-harm and extremism. Another observation is that memes can propagate globally through repackaging in different languages and that they can also be multilingual.
arXiv Detail & Related papers (2022-05-09T13:43:27Z)
Exploiting BERT For Multimodal Target SentimentClassification Through Input Space Translation [75.82110684355979]
We introduce a two-stream model that translates images in input space using an object-aware transformer. We then leverage the translation to construct an auxiliary sentence that provides multimodal information to a language model. We achieve state-of-the-art performance on two multimodal Twitter datasets.
arXiv Detail & Related papers (2021-08-03T18:02:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.