Related papers: Multimodal LLMs as Customized Reward Models for Text-to-Image Generation

Multimodal LLMs as Customized Reward Models for Text-to-Image Generation

URL: http://arxiv.org/abs/2507.21391v2
Date: Wed, 30 Jul 2025 04:49:38 GMT
Title: Multimodal LLMs as Customized Reward Models for Text-to-Image Generation
Authors: Shijie Zhou, Ruiyi Zhang, Huaisheng Zhu, Branislav Kveton, Yufan Zhou, Jiuxiang Gu, Jian Chen, Changyou Chen,
Abstract summary: We introduce LLaVA-Reward, an efficient reward model designed to automatically evaluate text-to-image (T2I) generations across multiple perspectives.<n>LLaVA-Reward directly utilizes the hidden states of multimodal large language models (MLLMs)<n>We train LLaVA-Reward on four evaluation perspectives: text-image alignment, fidelity/artifact, safety, and overall ranking.
Score: 60.164968941945645
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce LLaVA-Reward, an efficient reward model designed to automatically evaluate text-to-image (T2I) generations across multiple perspectives, leveraging pretrained multimodal large language models (MLLMs). Existing MLLM-based approaches require instruction-following data for supervised fine-tuning and evaluate generation quality on analyzing text response, which is time-consuming and difficult to train. To address this problem, we propose LLaVA-Reward, which directly utilizes the hidden states of MLLMs given text-image pairs. To enhance the bidirectional interaction between visual and textual representations in decoder-only MLLMs, we further propose adding a Skip-connection Cross Attention (SkipCA) module. This design enhances text-image correlation reasoning by connecting early-layer visual features with later-layer hidden representations. In addition, LLaVA-Reward supports different types of preference data for efficient fine-tuning, including paired preference data and unpaired data. We train LLaVA-Reward on four evaluation perspectives: text-image alignment, fidelity/artifact, safety, and overall ranking. Empirical results demonstrate that LLaVA-Reward outperforms conventional and MLLM-based methods in generating human-aligned scores for automatic evaluations and inference-time scaling in text-to-image generations.

Related papers

Language Integration in Fine-Tuning Multimodal Large Language Models for Image-Based Regression [2.9998889086656586]
We show how Transformer-Based Classification (RvTC) replaces vocabulary-constrained classification with a flexible bin-based approach.<n>Unlike generic task descriptions, prompts containing semantic information about specific images enable MLLMs to leverage cross-modal understanding.
arXiv Detail & Related papers (2025-07-20T15:05:24Z)
Demystifying the Visual Quality Paradox in Multimodal Large Language Models [49.154146792279946]
Recent Multimodal Large Language Models (MLLMs) excel on benchmark vision-language tasks, yet little is known about how input visual quality shapes their responses.<n>We conduct the first systematic study spanning leading MLLMs and a suite of vision-language benchmarks.<n>We uncover a visual-quality paradox: model, task, and even individual-instance performance can improve when images deviate from human-perceived fidelity.
arXiv Detail & Related papers (2025-06-18T17:14:07Z)
MLLM-Guided VLM Fine-Tuning with Joint Inference for Zero-Shot Composed Image Retrieval [50.062817677022586]
Zero-Shot Image Retrieval (ZS-CIR) methods typically train adapters that convert reference images into pseudo-text tokens.<n>We propose MLLM-Guided VLM Fine-Tuning with Joint Inference (MVFT-JI) to construct two complementary training tasks using only unlabeled images.
arXiv Detail & Related papers (2025-05-26T08:56:59Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
From Visuals to Vocabulary: Establishing Equivalence Between Image and Text Token Through Autoregressive Pre-training in MLLMs [23.011836329934255]
Vision Dynamic Embedding-Guided Pretraining (VDEP) is a hybrid autoregressive training paradigm for MLLMs.<n>The proposed method seamlessly integrates into standard models without architectural changes.<n> Experiments on 13 benchmarks show VDEP outperforms baselines, surpassing existing methods.
arXiv Detail & Related papers (2025-02-13T09:04:28Z)
ReMoDetect: Reward Models Recognize Aligned LLM's Generations [55.06804460642062]
Large language models (LLMs) generate human-preferable texts. In this paper, we identify the common characteristics shared by these models. We propose two training schemes to further improve the detection ability of the reward model.
arXiv Detail & Related papers (2024-05-27T17:38:33Z)
LLM Attributor: Interactive Visual Attribution for LLM Generation [29.116016627864095]
Python library provides interactive visualizations for training data attribution of large language models. Our library offers a new way to quickly attribute an LLM's text generation to training data points.
arXiv Detail & Related papers (2024-04-01T13:16:34Z)
VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons. We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z)
MLLMs-Augmented Visual-Language Representation Learning [70.5293060238008]
We demonstrate that Multi-modal Large Language Models (MLLMs) can enhance visual-language representation learning. Our approach is simple, utilizing MLLMs to extend multiple diverse captions for each image. We propose "text shearing" to maintain the quality and availability of extended captions.
arXiv Detail & Related papers (2023-11-30T18:05:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.