Related papers: XMeCap: Meme Caption Generation with Sub-Image Adaptability

XMeCap: Meme Caption Generation with Sub-Image Adaptability

URL: http://arxiv.org/abs/2407.17152v2
Date: Wed, 31 Jul 2024 12:56:22 GMT
Title: XMeCap: Meme Caption Generation with Sub-Image Adaptability
Authors: Yuyan Chen, Songzhou Yan, Zhihong Zhu, Zhixu Li, Yanghua Xiao,
Abstract summary: Humor, deeply rooted in societal meanings and cultural details, poses a unique challenge for machines. We introduce the textscXMeCap framework, which adopts supervised fine-tuning and reinforcement learning. textscXMeCap achieves an average evaluation score of 75.85 for single-image memes and 66.32 for multi-image memes, outperforming the best baseline by 3.71% and 4.82%, respectively.
Score: 53.2509590113364
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Humor, deeply rooted in societal meanings and cultural details, poses a unique challenge for machines. While advances have been made in natural language processing, real-world humor often thrives in a multi-modal context, encapsulated distinctively by memes. This paper poses a particular emphasis on the impact of multi-images on meme captioning. After that, we introduce the \textsc{XMeCap} framework, a novel approach that adopts supervised fine-tuning and reinforcement learning based on an innovative reward model, which factors in both global and local similarities between visuals and text. Our results, benchmarked against contemporary models, manifest a marked improvement in caption generation for both single-image and multi-image memes, as well as different meme categories. \textsc{XMeCap} achieves an average evaluation score of 75.85 for single-image memes and 66.32 for multi-image memes, outperforming the best baseline by 3.71\% and 4.82\%, respectively. This research not only establishes a new frontier in meme-related studies but also underscores the potential of machines in understanding and generating humor in a multi-modal setting.

Related papers

MemeReaCon: Probing Contextual Meme Understanding in Large Vision-Language Models [50.2355423914562]
We introduce MemeReaCon, a novel benchmark designed to evaluate how Large Vision Language Models (LVLMs) understand memes in their original context.<n>We collected memes from five different Reddit communities, keeping each meme's image, the post text, and user comments together.<n>Our tests with leading LVLMs show a clear weakness: models either fail to interpret critical information in the contexts, or overly focus on visual details while overlooking communicative purpose.
arXiv Detail & Related papers (2025-05-23T03:27:23Z)
Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models [76.68654868991517]
Long-form text in images, such as paragraphs in slides or documents, remains a major challenge for current generative models. We introduce a novel text-focused, binary tokenizer optimized for capturing detailed scene text features. We develop ModelName, a multimodal autoregressive model that excels in generating high-quality long-text images with unprecedented fidelity.
arXiv Detail & Related papers (2025-03-26T03:44:25Z)
Large Vision-Language Models for Knowledge-Grounded Data Annotation of Memes [5.243460995467895]
This study introduces ClassicMemes-50-templates (CM50), a large-scale dataset consisting of over 33,000 memes, centered around 50 popular meme templates. We also present an automated knowledge-grounded annotation pipeline leveraging large vision-language models to produce high-quality image captions, meme captions, and literary device labels.
arXiv Detail & Related papers (2025-01-23T17:18:30Z)
Text or Image? What is More Important in Cross-Domain Generalization Capabilities of Hate Meme Detection Models? [2.4899077941924967]
This paper delves into the formidable challenge of cross-domain generalization in multimodal hate meme detection. We provide enough pieces of evidence supporting the hypothesis that only the textual component of hateful memes enables the existing multimodal classifier to generalize across different domains. Our evaluation on a newly created confounder dataset reveals higher performance on text confounders as compared to image confounders with an average $Delta$F1 of 0.18.
arXiv Detail & Related papers (2024-02-07T15:44:55Z)
Meme-ingful Analysis: Enhanced Understanding of Cyberbullying in Memes Through Multimodal Explanations [48.82168723932981]
We introduce em MultiBully-Ex, the first benchmark dataset for multimodal explanation from code-mixed cyberbullying memes. A Contrastive Language-Image Pretraining (CLIP) approach has been proposed for visual and textual explanation of a meme.
arXiv Detail & Related papers (2024-01-18T11:24:30Z)
PromptMTopic: Unsupervised Multimodal Topic Modeling of Memes using Large Language Models [7.388466146105024]
We propose textPromptMTopic, a novel multimodal prompt-based model to learn topics from both text and visual modalities. Our model effectively extracts and clusters topics learned from memes, considering the semantic interaction between the text and visual modalities. Our work contributes to the understanding of the topics and themes of memes, a crucial form of communication in today's society.
arXiv Detail & Related papers (2023-12-11T03:36:50Z)
Social Meme-ing: Measuring Linguistic Variation in Memes [24.226580919186613]
We construct a computational pipeline to cluster individual instances of memes into templates and semantic variables. We make available the resulting textscSemanticMemes dataset of 3.8M images clustered by their semantic function. We use these clusters to analyze linguistic variation in memes, discovering not only that socially meaningful variation in meme usage exists between subreddits, but that patterns of meme innovation and acculturation within these communities align with previous findings on written language.
arXiv Detail & Related papers (2023-11-15T17:20:20Z)
What do you MEME? Generating Explanations for Visual Semantic Role Labelling in Memes [42.357272117919464]
We introduce a novel task - EXCLAIM, generating explanations for visual semantic role labeling in memes. To this end, we curate ExHVV, a novel dataset that offers natural language explanations of connotative roles for three types of entities. We also posit LUMEN, a novel multimodal, multi-task learning framework that endeavors to address EXCLAIM optimally.
arXiv Detail & Related papers (2022-12-01T18:21:36Z)
MemeTector: Enforcing deep focus for meme detection [8.794414326545697]
It is important to accurately retrieve image memes from social media to better capture the cultural and social aspects of online phenomena. We propose a methodology that utilizes the visual part of image memes as instances of the regular image class and the initial image memes. We employ a trainable attention mechanism on top of a standard ViT architecture to enhance the model's ability to focus on these critical parts.
arXiv Detail & Related papers (2022-05-26T10:50:29Z)
On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs. Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z)
Caption Enriched Samples for Improving Hateful Memes Detection [78.5136090997431]
The hateful meme challenge demonstrates the difficulty of determining whether a meme is hateful or not. Both unimodal language models and multimodal vision-language models cannot reach the human level of performance.
arXiv Detail & Related papers (2021-09-22T10:57:51Z)
Do Images really do the Talking? Analysing the significance of Images in Tamil Troll meme classification [0.16863755729554888]
We try to explore the significance of visual features of images in classifying memes. We try to incorporate the memes as troll and non-trolling memes based on the images and the text on them.
arXiv Detail & Related papers (2021-08-09T09:04:42Z)
Memes in the Wild: Assessing the Generalizability of the Hateful Memes Challenge Dataset [47.65948529524281]
We collect hateful and non-hateful memes from Pinterest to evaluate out-of-sample performance on models pre-trained on the Facebook dataset. We find that memes in the wild differ in two key aspects: 1) Captions must be extracted via OCR, and 2) Memes are more diverse than traditional memes', including screenshots of conversations or text on a plain background.
arXiv Detail & Related papers (2021-07-09T09:04:05Z)
Cross-Media Keyphrase Prediction: A Unified Framework with Multi-Modality Multi-Head Attention and Image Wordings [63.79979145520512]
We explore the joint effects of texts and images in predicting the keyphrases for a multimedia post. We propose a novel Multi-Modality Multi-Head Attention (M3H-Att) to capture the intricate cross-media interactions. Our model significantly outperforms the previous state of the art based on traditional attention networks.
arXiv Detail & Related papers (2020-11-03T08:44:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.