Refining Few-Step Text-to-Multiview Diffusion via Reinforcement Learning
- URL: http://arxiv.org/abs/2505.20107v1
- Date: Mon, 26 May 2025 15:11:26 GMT
- Title: Refining Few-Step Text-to-Multiview Diffusion via Reinforcement Learning
- Authors: Ziyi Zhang, Li Shen, Deheng Ye, Yong Luo, Huangxuan Zhao, Lefei Zhang,
- Abstract summary: We propose a novel reinforcement learning (RL) finetuning framework tailored for few-step T2MV diffusion models.<n>We first reformulate T2MV denoising across all views as a single unified Markov decision process.<n>We then introduce ZMV-Sampling, a test-time T2MV sampling technique that adds an inversion-denoising pass to reinforce both viewpoint and text conditioning.
- Score: 36.08180902708641
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-multiview (T2MV) generation, which produces coherent multiview images from a single text prompt, remains computationally intensive, while accelerated T2MV methods using few-step diffusion models often sacrifice image fidelity and view consistency. To address this, we propose a novel reinforcement learning (RL) finetuning framework tailored for few-step T2MV diffusion models to jointly optimize per-view fidelity and cross-view consistency. Specifically, we first reformulate T2MV denoising across all views as a single unified Markov decision process, enabling multiview-aware policy optimization driven by a joint-view reward objective. Next, we introduce ZMV-Sampling, a test-time T2MV sampling technique that adds an inversion-denoising pass to reinforce both viewpoint and text conditioning, resulting in improved T2MV generation at the cost of inference time. To internalize its performance gains into the base sampling policy, we develop MV-ZigAL, a novel policy optimization strategy that uses reward advantages of ZMV-Sampling over standard sampling as learning signals for policy updates. Finally, noting that the joint-view reward objective under-optimizes per-view fidelity but naively optimizing single-view metrics neglects cross-view alignment, we reframe RL finetuning for T2MV diffusion models as a constrained optimization problem that maximizes per-view fidelity subject to an explicit joint-view constraint, thereby enabling more efficient and balanced policy updates. By integrating this constrained optimization paradigm with MV-ZigAL, we establish our complete RL finetuning framework, referred to as MVC-ZigAL, which effectively refines the few-step T2MV diffusion baseline in both fidelity and consistency while preserving its few-step efficiency.
Related papers
- Auto-Regressively Generating Multi-View Consistent Images [10.513203377236744]
We propose the Multi-View Auto-Regressive (textbfMV-AR) method to generate consistent multi-view images from arbitrary prompts.<n>When generating widely-separated views, MV-AR can utilize all its preceding views to extract effective reference information.<n>Experiments demonstrate the performance and versatility of our MV-AR, which consistently generates consistent multi-view images.
arXiv Detail & Related papers (2025-06-23T11:28:37Z) - EAM: Enhancing Anything with Diffusion Transformers for Blind Super-Resolution [11.331361804059625]
Enhancing Anything Model (EAM) is a novel Blind Super-Resolution method.<n>We introduce a novel block, $Psi$-DiT, which effectively guides the DiT to enhance image restoration.<n>EAM achieves state-of-the-art results across multiple datasets, outperforming existing methods in both quantitative metrics and visual quality.
arXiv Detail & Related papers (2025-05-08T13:03:07Z) - DiVE: Efficient Multi-View Driving Scenes Generation Based on Video Diffusion Transformer [56.98400572837792]
DiVE produces high-fidelity, temporally coherent, and cross-view consistent multi-view videos.<n>These innovations collectively achieve a 2.62x speedup with minimal quality degradation.
arXiv Detail & Related papers (2025-04-28T09:20:50Z) - NVComposer: Boosting Generative Novel View Synthesis with Multiple Sparse and Unposed Images [50.36605863731669]
NVComposer is a novel approach that eliminates the need for explicit external alignment.<n> NVComposer achieves state-of-the-art performance in generative multi-view NVS tasks.<n>Our approach shows substantial improvements in synthesis quality as the number of unposed input views increases.
arXiv Detail & Related papers (2024-12-04T17:58:03Z) - FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion Model [76.84519526283083]
We present the textbfFlexible Vision Transformer (FiT), a transformer architecture specifically designed for generating images with textitunrestricted resolutions and aspect ratios
FiTv2 exhibits $2times$ convergence speed of FiT, when incorporating advanced training-free extrapolation techniques.
Comprehensive experiments demonstrate the exceptional performance of FiTv2 across a broad range of resolutions.
arXiv Detail & Related papers (2024-10-17T15:51:49Z) - Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality [69.76121008898677]
Fine-grained Selective Calibrated CLIP integrates local hard negative loss and selective calibrated regularization.
Our evaluations show that FSC-CLIP not only achieves compositionality on par with state-of-the-art models but also retains strong multi-modal capabilities.
arXiv Detail & Related papers (2024-10-07T17:16:20Z) - DualKanbaFormer: An Efficient Selective Sparse Framework for Multimodal Aspect-based Sentiment Analysis [0.6187939267100836]
We introduce DualKanbaFormer, a novel framework that leverages parallel Textual and Visual KanbaFormer modules for robust multimodal analysis.<n>Our approach incorporates Aspect-Driven Sparse Attention (ADSA) to balance coarse-grained aggregation and fine-grained selection for aspect-focused precision.<n>We replace traditional feed-forward networks and normalization with Kolmogorov-Arnold Networks (KANs) and Dynamic Tanh (DyT) to enhance non-linear expressivity and inference stability.
arXiv Detail & Related papers (2024-08-27T19:33:15Z) - INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model [71.50973774576431]
We propose a novel MLLM, INF-LLaVA, designed for effective high-resolution image perception.
We introduce a Dual-perspective Cropping Module (DCM), which ensures that each sub-image contains continuous details from a local perspective.
Second, we introduce Dual-perspective Enhancement Module (DEM) to enable the mutual enhancement of global and local features.
arXiv Detail & Related papers (2024-07-23T06:02:30Z) - A-SDM: Accelerating Stable Diffusion through Model Assembly and Feature Inheritance Strategies [51.7643024367548]
Stable Diffusion Model is a prevalent and effective model for text-to-image (T2I) and image-to-image (I2I) generation.
This study focuses on reducing redundant computation in SDM and optimizing the model through both tuning and tuning-free methods.
arXiv Detail & Related papers (2024-05-31T21:47:05Z) - Learning to Rebalance Multi-Modal Optimization by Adaptively Masking Subnetworks [13.065212096469537]
We propose a novel importance sampling-based, element-wise joint optimization method, called Adaptively Mask Subnetworks Considering Modal Significance(AMSS)
Specifically, we incorporate mutual information rates to determine the modal significance and employ non-uniform adaptive sampling to select foregroundworks from each modality for parameter updates.
Building upon theoretical insights, we further enhance the multi-modal mask subnetwork strategy using unbiased estimation, referred to as AMSS+.
arXiv Detail & Related papers (2024-04-12T09:22:24Z) - Referee Can Play: An Alternative Approach to Conditional Generation via
Model Inversion [35.21106030549071]
Diffusion Probabilistic Models (DPMs) are dominant force in text-to-image generation tasks.
We propose an alternative view of state-of-the-art DPMs as a way of inverting advanced Vision-Language Models (VLMs)
By directly optimizing images with the supervision of discriminative VLMs, the proposed method can potentially achieve a better text-image alignment.
arXiv Detail & Related papers (2024-02-26T05:08:40Z) - FiT: Flexible Vision Transformer for Diffusion Model [81.85667773832279]
We present a transformer architecture specifically designed for generating images with unrestricted resolutions and aspect ratios.
Unlike traditional methods that perceive images as static-resolution grids, FiT conceptualizes images as sequences of dynamically-sized tokens.
Comprehensive experiments demonstrate the exceptional performance of FiT across a broad range of resolutions.
arXiv Detail & Related papers (2024-02-19T18:59:07Z) - ES-MVSNet: Efficient Framework for End-to-end Self-supervised Multi-View
Stereo [11.41432976633312]
In this work, we propose an efficient framework for end-to-end self-supervised MVS, dubbed ES-MVSNet.
To alleviate the high memory consumption of current E2E self-supervised MVS frameworks, we present a memory-efficient architecture that reduces memory usage by 43% without compromising model performance.
With the novel design of asymmetric view selection policy and region-aware depth consistency, we achieve state-of-the-art performance among E2E self-supervised MVS methods, without relying on third-party models for additional consistency signals.
arXiv Detail & Related papers (2023-08-04T08:16:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.