Related papers: Preference Score Distillation: Leveraging 2D Rewards to Align Text-to-3D Generation with Human Preference

Preference Score Distillation: Leveraging 2D Rewards to Align Text-to-3D Generation with Human Preference

URL: http://arxiv.org/abs/2603.01594v1
Date: Mon, 02 Mar 2026 08:23:36 GMT
Title: Preference Score Distillation: Leveraging 2D Rewards to Align Text-to-3D Generation with Human Preference
Authors: Jiaqi Leng, Shuyuan Tu, Haidong Cao, Sicheng Xie, Daoguo Dong, Zuxuan Wu, Yu-Gang Jiang,
Abstract summary: Preference Score Distillation (PSD) is an optimization-based framework for human-aligned text-to-3D synthesis without 3D training data.<n>Our key insight stems from the incompatibility of pixel-level gradients.<n>We introduce an adaptive strategy to co-optimize preference scores and negative text embeddings.
Score: 69.34278282513593
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human preference alignment presents a critical yet underexplored challenge for diffusion models in text-to-3D generation. Existing solutions typically require task-specific fine-tuning, posing significant hurdles in data-scarce 3D domains. To address this, we propose Preference Score Distillation (PSD), an optimization-based framework that leverages pretrained 2D reward models for human-aligned text-to-3D synthesis without 3D training data. Our key insight stems from the incompatibility of pixel-level gradients: due to the absence of noisy samples during reward model training, direct application of 2D reward gradients disturbs the denoising process. Noticing that similar issue occurs in the naive classifier guidance in conditioned diffusion models, we fundamentally rethink preference alignment as a classifier-free guidance (CFG)-style mechanism through our implicit reward model. Furthermore, recognizing that frozen pretrained diffusion models constrain performance, we introduce an adaptive strategy to co-optimize preference scores and negative text embeddings. By incorporating CFG during optimization, online refinement of negative text embeddings dynamically enhances alignment. To our knowledge, we are the first to bridge human preference alignment with CFG theory under score distillation framework. Experiments demonstrate the superiority of PSD in aesthetic metrics, seamless integration with diverse pipelines, and strong extensibility.

Related papers

UniC-Lift: Unified 3D Instance Segmentation via Contrastive Learning [6.502142457981839]
3D Gaussian Splatting (3DGS) and Neural Radiance Fields (NeRF) have advanced novel-view synthesis.<n>Recent methods extend multi-view 2D segmentation to 3D, enabling instance/semantic segmentation for better scene understanding.<n>Key challenge is the inconsistency of 2D instance labels across views, leading to poor 3D predictions.<n>We propose a unified framework that merges these steps, reducing training time and improving performance by introducing a learnable feature embedding for segmentation in Gaussian primitives.
arXiv Detail & Related papers (2025-12-31T10:20:01Z)
Advancing Text-to-3D Generation with Linearized Lookahead Variational Score Distillation [10.863222482923605]
We propose a linearized variant of the model for score distillation, giving rise to the Linearized Lookahead Variational Score Distillation ($L2$-VSD)<n>$L2$-VSD can be realized efficiently with forward-mode autodiff functionalities of existing deep learning libraries.<n>We also show that our method can be seamlessly incorporated into any other VSD-based text-to-3D framework.
arXiv Detail & Related papers (2025-07-13T18:57:45Z)
Learning to Align and Refine: A Foundation-to-Diffusion Framework for Occlusion-Robust Two-Hand Reconstruction [50.952228546326516]
Two-hand reconstruction from monocular images faces persistent challenges due to complex and dynamic hand postures.<n>Existing approaches struggle with such alignment issues, often resulting in misalignment and penetration artifacts.<n>We propose a dual-stage Foundation-to-Diffusion framework that precisely align 2D prior guidance from vision foundation models.
arXiv Detail & Related papers (2025-03-22T14:42:27Z)
Adapting Human Mesh Recovery with Vision-Language Feedback [17.253535686451897]
We leverage vision-language models to generate interactive body part descriptions.<n>We train a text encoder and a pose VQ-VAE, aligning texts to body poses in a shared latent space.<n>The model can produce poses with accurate 3D perception and image consistency.
arXiv Detail & Related papers (2025-02-06T07:42:00Z)
A Lesson in Splats: Teacher-Guided Diffusion for 3D Gaussian Splats Generation with 2D Supervision [65.33043028101471]
We present a novel framework for training 3D image-conditioned diffusion models using only 2D supervision.<n>Most existing 3D generative models rely on full 3D supervision, which is impractical due to the scarcity of large-scale 3D datasets.
arXiv Detail & Related papers (2024-12-01T00:29:57Z)
ALOcc: Adaptive Lifting-Based 3D Semantic Occupancy and Cost Volume-Based Flow Predictions [91.55655961014027]
3D semantic occupancy and flow prediction are fundamental to understanding scene scene.<n>This paper proposes a vision-based framework with three targeted improvements.<n>Our purely convolutional architecture establishes new SOTA performance on multiple benchmarks for both semantic occupancy and joint semantic-flow prediction.
arXiv Detail & Related papers (2024-11-12T11:32:56Z)
FILP-3D: Enhancing 3D Few-shot Class-incremental Learning with Pre-trained Vision-Language Models [59.13757801286343]
Few-shot class-incremental learning aims to mitigate the catastrophic forgetting issue when a model is incrementally trained on limited data.<n>We introduce the FILP-3D framework with two novel components: the Redundant Feature Eliminator (RFE) for feature space misalignment and the Spatial Noise Compensator (SNC) for significant noise.
arXiv Detail & Related papers (2023-12-28T14:52:07Z)
Efficient Text-Guided 3D-Aware Portrait Generation with Score Distillation Sampling on Distribution [28.526714129927093]
We propose DreamPortrait, which aims to generate text-guided 3D-aware portraits in a single-forward pass for efficiency. We further design a 3D-aware gated cross-attention mechanism to explicitly let the model perceive the correspondence between the text and the 3D-aware space.
arXiv Detail & Related papers (2023-06-03T11:08:38Z)
HiFA: High-fidelity Text-to-3D Generation with Advanced Diffusion Guidance [19.252300247300145]
This work proposes holistic sampling and smoothing approaches to achieve high-quality text-to-3D generation. We compute denoising scores in the text-to-image diffusion model's latent and image spaces. To generate high-quality renderings in a single-stage optimization, we propose regularization for the variance of z-coordinates along NeRF rays.
arXiv Detail & Related papers (2023-05-30T05:56:58Z)
Learned Vertex Descent: A New Direction for 3D Human Model Fitting [64.04726230507258]
We propose a novel optimization-based paradigm for 3D human model fitting on images and scans. Our approach is able to capture the underlying body of clothed people with very different body shapes, achieving a significant improvement compared to state-of-the-art. LVD is also applicable to 3D model fitting of humans and hands, for which we show a significant improvement to the SOTA with a much simpler and faster method.
arXiv Detail & Related papers (2022-05-12T17:55:51Z)
Uncertainty-Aware Adaptation for Self-Supervised 3D Human Pose Estimation [70.32536356351706]
We introduce MRP-Net that constitutes a common deep network backbone with two output heads subscribing to two diverse configurations. We derive suitable measures to quantify prediction uncertainty at both pose and joint level. We present a comprehensive evaluation of the proposed approach and demonstrate state-of-the-art performance on benchmark datasets.
arXiv Detail & Related papers (2022-03-29T07:14:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.