MVReward: Better Aligning and Evaluating Multi-View Diffusion Models with Human Preferences
- URL: http://arxiv.org/abs/2412.06614v1
- Date: Mon, 09 Dec 2024 16:05:31 GMT
- Title: MVReward: Better Aligning and Evaluating Multi-View Diffusion Models with Human Preferences
- Authors: Weitao Wang, Haoran Xu, Yuxiao Yang, Zhifang Liu, Jun Meng, Haoqian Wang,
- Abstract summary: We present a comprehensive framework to better align and evaluate multi-view diffusion models with human preferences.<n>We also propose Multi-View Preference Learning (MVP), a plug-and-play multi-view diffusion tuning strategy.
- Score: 23.367079270965068
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recent years have witnessed remarkable progress in 3D content generation. However, corresponding evaluation methods struggle to keep pace. Automatic approaches have proven challenging to align with human preferences, and the mixed comparison of text- and image-driven methods often leads to unfair evaluations. In this paper, we present a comprehensive framework to better align and evaluate multi-view diffusion models with human preferences. To begin with, we first collect and filter a standardized image prompt set from DALL$\cdot$E and Objaverse, which we then use to generate multi-view assets with several multi-view diffusion models. Through a systematic ranking pipeline on these assets, we obtain a human annotation dataset with 16k expert pairwise comparisons and train a reward model, coined MVReward, to effectively encode human preferences. With MVReward, image-driven 3D methods can be evaluated against each other in a more fair and transparent manner. Building on this, we further propose Multi-View Preference Learning (MVP), a plug-and-play multi-view diffusion tuning strategy. Extensive experiments demonstrate that MVReward can serve as a reliable metric and MVP consistently enhances the alignment of multi-view diffusion models with human preferences.
Related papers
- Optimizing Multi-Round Enhanced Training in Diffusion Models for Improved Preference Understanding [29.191627597682597]
We present a framework incorporating human-in-the-loop feedback, leveraging a well-trained reward model aligned with user preferences.
Our approach consistently surpasses competing models in user satisfaction, especially in multi-turn dialogue scenarios.
arXiv Detail & Related papers (2025-04-25T09:35:02Z) - MEAT: Multiview Diffusion Model for Human Generation on Megapixels with Mesh Attention [83.56588173102594]
We introduce a solution called mesh attention to enable training at 1024x1024 resolution.
This approach significantly reduces the complexity of multiview attention while maintaining cross-view consistency.
Building on this foundation, we devise a mesh attention block and combine it with keypoint conditioning to create our human-specific multiview diffusion model, MEAT.
arXiv Detail & Related papers (2025-03-11T17:50:59Z) - MEt3R: Measuring Multi-View Consistency in Generated Images [47.152540564255204]
We introduce MEt3R, a metric for multi-view consistency in generated images.
Our approach uses DUSt3R to obtain dense 3D reconstructions from image pairs in a feed-forward manner.
arXiv Detail & Related papers (2025-01-10T20:43:33Z) - MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors Enhanced Diffusion Model [87.71060849866093]
We introduce MVGenMaster, a multi-view diffusion model enhanced with 3D priors to address versatile Novel View Synthesis (NVS) tasks.
Our model features a simple yet effective pipeline that can generate up to 100 novel views conditioned on variable reference views and camera poses.
We present several training and model modifications to strengthen the model with scaled-up datasets.
arXiv Detail & Related papers (2024-11-25T07:34:23Z) - MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models [85.30735602813093]
Multi-Image Augmented Direct Preference Optimization (MIA-DPO) is a visual preference alignment approach that effectively handles multi-image inputs.
MIA-DPO mitigates the scarcity of diverse multi-image training data by extending single-image data with unrelated images arranged in grid collages or pic-in-pic formats.
arXiv Detail & Related papers (2024-10-23T07:56:48Z) - Flex3D: Feed-Forward 3D Generation With Flexible Reconstruction Model And Input View Curation [61.040832373015014]
We propose Flex3D, a novel framework for generating high-quality 3D content from text, single images, or sparse view images.
We employ a fine-tuned multi-view image diffusion model and a video diffusion model to generate a pool of candidate views, enabling a rich representation of the target 3D object.
In the second stage, the curated views are fed into a Flexible Reconstruction Model (FlexRM), built upon a transformer architecture that can effectively process an arbitrary number of inputs.
arXiv Detail & Related papers (2024-10-01T17:29:43Z) - Generative Object Insertion in Gaussian Splatting with a Multi-View Diffusion Model [15.936267489962122]
We propose a novel method for object insertion in 3D content represented by Gaussian Splatting.
Our approach introduces a multi-view diffusion model, dubbed MVInpainter, which is built upon a pre-trained stable video diffusion model.
Within MVInpainter, we incorporate a ControlNet-based conditional injection module to enable controlled and more predictable multi-view generation.
arXiv Detail & Related papers (2024-09-25T13:52:50Z) - MVHuman: Tailoring 2D Diffusion with Multi-view Sampling For Realistic
3D Human Generation [45.88714821939144]
We present an alternative scheme named MVHuman to generate human radiance fields from text guidance.
Our core is a multi-view sampling strategy to tailor the denoising processes of the pre-trained network for generating consistent multi-view images.
arXiv Detail & Related papers (2023-12-15T11:56:26Z) - EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion [60.30030562932703]
EpiDiff is a localized interactive multiview diffusion model.
It generates 16 multiview images in just 12 seconds.
It surpasses previous methods in quality evaluation metrics.
arXiv Detail & Related papers (2023-12-11T05:20:52Z) - Direct Multi-view Multi-person 3D Pose Estimation [138.48139701871213]
We present Multi-view Pose transformer (MvP) for estimating multi-person 3D poses from multi-view images.
MvP directly regresses the multi-person 3D poses in a clean and efficient way, without relying on intermediate tasks.
We show experimentally that our MvP model outperforms the state-of-the-art methods on several benchmarks while being much more efficient.
arXiv Detail & Related papers (2021-11-07T13:09:20Z) - Learning Implicit 3D Representations of Dressed Humans from Sparse Views [31.584157304372425]
We propose an end-to-end approach that learns an implicit 3D representation of dressed humans from sparse camera views.
In the experiments, we show the proposed approach outperforms the state of the art on standard data both quantitatively and qualitatively.
arXiv Detail & Related papers (2021-04-16T10:20:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.