Related papers: REVEALER: Reinforcement-Guided Visual Reasoning for Element-Level Text-Image Alignment Evaluation

REVEALER: Reinforcement-Guided Visual Reasoning for Element-Level Text-Image Alignment Evaluation

URL: http://arxiv.org/abs/2512.23169v1
Date: Mon, 29 Dec 2025 03:24:09 GMT
Title: REVEALER: Reinforcement-Guided Visual Reasoning for Element-Level Text-Image Alignment Evaluation
Authors: Fulin Shi, Wenyi Xiao, Bin Chen, Liang Din, Leilei Gan,
Abstract summary: REVEALER is a unified framework for element-level alignment evaluation based on reinforcement-guided visual reasoning.<n>Our method enables Multimodal Large Language Models (MLLMs) to explicitly localize semantic elements and derive interpretable alignment judgments.
Score: 10.151027538362259
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Evaluating the alignment between textual prompts and generated images is critical for ensuring the reliability and usability of text-to-image (T2I) models. However, most existing evaluation methods rely on coarse-grained metrics or static QA pipelines, which lack fine-grained interpretability and struggle to reflect human preferences. To address this, we propose REVEALER, a unified framework for element-level alignment evaluation based on reinforcement-guided visual reasoning. Adopting a structured "grounding-reasoning-conclusion" paradigm, our method enables Multimodal Large Language Models (MLLMs) to explicitly localize semantic elements and derive interpretable alignment judgments. We optimize the model via Group Relative Policy Optimization(GRPO) using a composite reward function that incorporates structural format, grounding accuracy, and alignment fidelity. Extensive experiments across four benchmarks-EvalMuse-40K, RichHF, MHaluBench, and GenAI-Bench-demonstrate that REVEALER achieves state-of-the-art performance. Our approach consistently outperforms both strong proprietary models and supervised baselines while demonstrating superior inference efficiency compared to existing iterative visual reasoning methods.

Related papers

IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation [85.56193980646981]
We propose IF-RewardBench, a comprehensive meta-evaluation benchmark for instruction-following.<n>For each instruction, we construct a preference graph containing all pairwise preferences among multiple responses.<n>Experiments on IF-RewardBench reveal significant deficiencies in current judge models.
arXiv Detail & Related papers (2026-03-05T02:21:17Z)
Unified Personalized Reward Model for Vision Generation [27.496220369122494]
We propose UnifiedReward-Flex, a unified personalized reward model for vision generation.<n>We first distill structured, high-quality reasoning traces from advanced closed-source VLMs to bootstrap SFT.<n>We then perform direct preference optimization (DPO) on carefully curated preference pairs to further strengthen reasoning fidelity and discriminative alignment.
arXiv Detail & Related papers (2026-02-02T17:44:21Z)
Generative Actor Critic [74.04971271003869]
Generative Actor Critic (GAC) is a novel framework that decouples sequential decision-making by reframing textitpolicy evaluation as learning a generative model of the joint distribution over trajectories and returns.<n>Experiments on Gym-MuJoCo and Maze2D benchmarks demonstrate GAC's strong offline performance and significantly enhanced offline-to-online improvement compared to state-of-the-art methods.
arXiv Detail & Related papers (2025-12-25T06:31:11Z)
Generative Reasoning Recommendation via LLMs [48.45009951684554]
Large language models (LLMs) face fundamental challenges in functioning as generative reasoning recommendation models (GRRMs)<n>This work explores how to build GRRMs by adapting pre-trained LLMs, which achieves a unified understanding-reasoning-prediction manner for recommendation tasks.<n>We propose GREAM, an end-to-end framework that integrates three components: Collaborative-Semantic Alignment, Reasoning Curriculum Activation, and Sparse-Regularized Group Policy Optimization.
arXiv Detail & Related papers (2025-10-23T17:59:31Z)
CoT Referring: Improving Referring Expression Tasks with Grounded Reasoning [67.18702329644526]
CoT Referring enhances model reasoning across modalities through a structured, chain-of-thought training data structure.<n>We restructure the training data to enforce a new output form, providing new annotations for existing datasets.<n>We also integrate detection and segmentation capabilities into a unified MLLM framework, training it with a novel adaptive weighted loss to optimize performance.
arXiv Detail & Related papers (2025-10-03T08:50:21Z)
ThinkFake: Reasoning in Multimodal Large Language Models for AI-Generated Image Detection [51.93101033997245]
Increasing realism of AI-generated images has raised serious concerns about misinformation and privacy violations.<n>We propose ThinkFake, a novel reasoning-based and generalizable framework for AI-generated image detection.<n>We show that ThinkFake outperforms state-of-the-art methods on the GenImage benchmark and demonstrates strong zero-shot generalization on the challenging LOKI benchmark.
arXiv Detail & Related papers (2025-09-24T07:34:09Z)
LGAI-EMBEDDING-Preview Technical Report [41.68404082385825]
This report presents a unified instruction-based framework for learning generalized text embeddings optimized for both information retrieval (IR) and non-IR tasks.<n>Our approach combines in-context learning, soft supervision, and adaptive hard-negative mining to generate context-aware embeddings.<n>Results show that our method achieves strong generalization and ranks among the top-performing models by Borda score.
arXiv Detail & Related papers (2025-06-09T05:30:35Z)
Multimodal Benchmarking and Recommendation of Text-to-Image Generation Models [0.0]
This work presents an open-source unified benchmarking and evaluation framework for text-to-image generation models.<n>Our framework enables task-specific recommendations for model selection and prompt design based on evaluation metrics.
arXiv Detail & Related papers (2025-05-06T18:53:34Z)
AIR: A Systematic Analysis of Annotations, Instructions, and Response Pairs in Preference Dataset [89.37514696019484]
Preference learning is critical for aligning large language models with human values.<n>Our work shifts preference dataset design from ad hoc scaling to component-aware optimization.
arXiv Detail & Related papers (2025-04-04T17:33:07Z)
Unifying Structure and Language Semantic for Efficient Contrastive Knowledge Graph Completion with Structured Entity Anchors [0.3913403111891026]
The goal of knowledge graph completion (KGC) is to predict missing links in a KG using trained facts that are already known. We propose a novel method to effectively unify structure information and language semantics without losing the power of inductive reasoning.
arXiv Detail & Related papers (2023-11-07T11:17:55Z)
An Examination of the Compositionality of Large Generative Vision-Language Models [7.639748270719836]
Generative Vision-Language Models (GVLMs) have been constructed via multimodal instruction tuning. In this paper, we examine both the evaluation metrics (VisualGPTScore, etc.) and current benchmarks for evaluating the compositionality of GVLMs. We identify the syntactical bias in current benchmarks, which is exploited by the linguistic capability of GVLMs.
arXiv Detail & Related papers (2023-08-21T06:50:29Z)
Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning [53.68371566336254]
We argue that the key to better performance lies in meaningful latent modality structures instead of perfect modality alignment. Specifically, we design 1) a deep feature separation loss for intra-modality regularization; 2) a Brownian-bridge loss for inter-modality regularization; and 3) a geometric consistency loss for both intra- and inter-modality regularization.
arXiv Detail & Related papers (2023-03-10T14:38:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.