From Perception to Punchline: Empowering VLM with the Art of In-the-wild Meme
- URL: http://arxiv.org/abs/2512.24555v1
- Date: Wed, 31 Dec 2025 01:35:49 GMT
- Title: From Perception to Punchline: Empowering VLM with the Art of In-the-wild Meme
- Authors: Xueyan Li, Yingyi Xue, Mengjie Jiang, Qingzi Zhu, Yazhe Niu,
- Abstract summary: We propose HUMOR, a novel framework that guides memes through hierarchical reasoning and aligns them with group-wise human preferences.<n>To capture subjective humor, we train a pairwise reward model that operates within groups of memes sharing the same template.<n>Our work presents a general training paradigm for open-ended, human-aligned multimodal generation.
- Score: 5.462301274468853
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generating humorous memes is a challenging multimodal task that moves beyond direct image-to-caption supervision. It requires a nuanced reasoning over visual content, contextual cues, and subjective humor. To bridge this gap between visual perception and humorous punchline creation, we propose HUMOR}, a novel framework that guides VLMs through hierarchical reasoning and aligns them with group-wise human preferences. First, HUMOR employs a hierarchical, multi-path Chain-of-Thought (CoT): the model begins by identifying a template-level intent, then explores diverse reasoning paths under different contexts, and finally anchors onto a high-quality, context-specific path. This CoT supervision, which traces back from ground-truth captions, enhances reasoning diversity. We further analyze that this multi-path exploration with anchoring maintains a high expected humor quality, under the practical condition that high-quality paths retain significant probability mass. Second, to capture subjective humor, we train a pairwise reward model that operates within groups of memes sharing the same template. Following established theory, this approach ensures a consistent and robust proxy for human preference, even with subjective and noisy labels. The reward model then enables a group-wise reinforcement learning optimization, guaranteeing providing a theoretical guarantee for monotonic improvement within the trust region. Extensive experiments show that HUMOR empowers various VLMs with superior reasoning diversity, more reliable preference alignment, and higher overall meme quality. Beyond memes, our work presents a general training paradigm for open-ended, human-aligned multimodal generation, where success is guided by comparative judgment within coherent output group.
Related papers
- Can Thinking Models Think to Detect Hateful Memes? [7.77199523320035]
Thinking-based multimodal large language models (MLLMs) have recently advanced vision-language understanding.<n>We propose a reinforcement learning based post-training framework that improves reasoning in thinking-based MLLMs.<n>Our approach achieves state-of-the-art performance, improving accuracy and F1 by approximately 1 percent and explanation quality by approximately 3 percent.
arXiv Detail & Related papers (2026-03-01T18:41:52Z) - MM-SCALE: Grounded Multimodal Moral Reasoning via Scalar Judgment and Listwise Alignment [48.39756797294967]
We present MM-SCALE, a dataset for aligning Vision-Language Models with human moral preferences.<n>Each image-scenario pair is annotated with moral acceptability scores and grounded reasoning labels by humans.<n>Our framework provides richer alignment signals and finer calibration of multimodal moral reasoning.
arXiv Detail & Related papers (2026-02-03T15:48:00Z) - Unified Personalized Reward Model for Vision Generation [27.496220369122494]
We propose UnifiedReward-Flex, a unified personalized reward model for vision generation.<n>We first distill structured, high-quality reasoning traces from advanced closed-source VLMs to bootstrap SFT.<n>We then perform direct preference optimization (DPO) on carefully curated preference pairs to further strengthen reasoning fidelity and discriminative alignment.
arXiv Detail & Related papers (2026-02-02T17:44:21Z) - From Sparse Decisions to Dense Reasoning: A Multi-attribute Trajectory Paradigm for Multimodal Moderation [59.27094165576015]
We propose a novel learning paradigm (UniMod) that transitions from sparse decision-making to dense reasoning traces.<n>By constructing structured trajectories encompassing evidence grounding, modality assessment, risk mapping, policy decision, and response generation, we reformulate monolithic decision tasks into a multi-dimensional boundary learning process.<n>We introduce specialized optimization strategies to decouple task-specific parameters and rebalance training dynamics, effectively resolving interference between diverse objectives in multi-task learning.
arXiv Detail & Related papers (2026-01-28T09:29:40Z) - ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning [76.95203056566191]
Multimodal reasoning requires iterative coordination between language and vision, yet it remains unclear what constitutes a meaningful interleaved chain of thought.<n>We build ThinkMorph, a unified model fine-tuned on approximately 24K high-quality interleaved reasoning traces spanning tasks with varying visual engagement.<n>ThinkMorph learns to generate progressive text-image reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic.
arXiv Detail & Related papers (2025-10-30T17:51:38Z) - SONA: Learning Conditional, Unconditional, and Mismatching-Aware Discriminator [54.562217603802075]
We introduce Sum of Naturalness and Alignment (SONA), which employs separate projections for naturalness (authenticity) and alignment in the final layer with an inductive bias.<n>Experiments on class-conditional generation tasks show thatSONA achieves superior sample quality and conditional alignment compared to state-of-the-art methods.
arXiv Detail & Related papers (2025-10-06T08:26:06Z) - MultiCrafter: High-Fidelity Multi-Subject Generation via Spatially Disentangled Attention and Identity-Aware Reinforcement Learning [28.841076643572933]
Multi-subject image generation aims to synthesize user-provided subjects in a single image.<n>Existing methods are limited by their reliance on simple reconstruction-based objectives.<n>We propose MultiCrafter, a framework that ensures high-fidelity, preference-aligned generation.
arXiv Detail & Related papers (2025-09-26T06:41:43Z) - D-HUMOR: Dark Humor Understanding via Multimodal Open-ended Reasoning - A Benchmark Dataset and Method [4.561044673225099]
Dark humor in online memes poses unique challenges due to its reliance on implicit, sensitive, and culturally contextual cues.<n>We introduce a novel dataset of 4,379 memes annotated for dark humor, target category (gender, mental health, violence, race, disability, and other), and a three-level intensity rating.<n>We propose a reasoning-augmented framework that first generates structured explanations for each meme using a Large Vision-Language Model.
arXiv Detail & Related papers (2025-09-08T14:55:16Z) - PRISM: Perspective Reasoning for Integrated Synthesis and Mediation as a Multi-Perspective Framework for AI Alignment [0.0]
Perspective Reasoning for Integrated Synthesis and Mediation (PRISM) is a framework for addressing persistent challenges in AI alignment.<n>PRISM organizes moral concerns into seven "basis worldviews", each hypothesized to capture a distinct dimension of human moral cognition.<n>We briefly outline future directions, including real-world deployments and formal verifications, while maintaining the core focus on multi-perspective synthesis and conflict mediation.
arXiv Detail & Related papers (2025-02-05T02:13:57Z) - Deep Boosting Learning: A Brand-new Cooperative Approach for Image-Text Matching [53.05954114863596]
We propose a brand-new Deep Boosting Learning (DBL) algorithm for image-text matching.
An anchor branch is first trained to provide insights into the data properties.
A target branch is concurrently tasked with more adaptive margin constraints to further enlarge the relative distance between matched and unmatched samples.
arXiv Detail & Related papers (2024-04-28T08:44:28Z) - Rewarded soups: towards Pareto-optimal alignment by interpolating
weights fine-tuned on diverse rewards [101.7246658985579]
Foundation models are first pre-trained on vast unsupervised datasets and then fine-tuned on labeled data.
We propose embracing the heterogeneity of diverse rewards by following a multi-policy strategy.
We demonstrate the effectiveness of our approach for text-to-text (summarization, Q&A, helpful assistant, review), text-image (image captioning, text-to-image generation, visual grounding, VQA), and control (locomotion) tasks.
arXiv Detail & Related papers (2023-06-07T14:58:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.