Related papers: SemanticNVS: Improving Semantic Scene Understanding in Generative Novel View Synthesis

SemanticNVS: Improving Semantic Scene Understanding in Generative Novel View Synthesis

URL: http://arxiv.org/abs/2602.20079v1
Date: Mon, 23 Feb 2026 17:45:21 GMT
Title: SemanticNVS: Improving Semantic Scene Understanding in Generative Novel View Synthesis
Authors: Xinya Chen, Christopher Wewer, Jiahao Xie, Xinting Hu, Jan Eric Lenssen,
Abstract summary: We present SemanticNVS, a camera-conditioned multi-view diffusion model for novel view synthesis (NVS)<n>Existing NVS methods generate semantically implausible and distorted images under long-range camera motion.<n>We propose to integrate pre-trained semantic feature extractors to incorporate stronger scene semantics as conditioning to achieve high-quality generation even at distant viewpoints.
Score: 25.524477911101325
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present SemanticNVS, a camera-conditioned multi-view diffusion model for novel view synthesis (NVS), which improves generation quality and consistency by integrating pre-trained semantic feature extractors. Existing NVS methods perform well for views near the input view, however, they tend to generate semantically implausible and distorted images under long-range camera motion, revealing severe degradation. We speculate that this degradation is due to current models failing to fully understand their conditioning or intermediate generated scene content. Here, we propose to integrate pre-trained semantic feature extractors to incorporate stronger scene semantics as conditioning to achieve high-quality generation even at distant viewpoints. We investigate two different strategies, (1) warped semantic features and (2) an alternating scheme of understanding and generation at each denoising step. Experimental results on multiple datasets demonstrate the clear qualitative and quantitative (4.69%-15.26% in FID) improvement over state-of-the-art alternatives.

Related papers

BetterScene: 3D Scene Synthesis with Representation-Aligned Generative Model [3.7515646463759698]
We present BetterScene, an approach to enhance novel view synthesis (NVS) quality for diverse real-world scenes using extremely sparse, unconstrained photos.<n>BetterScene leverages the production-ready Stable Video Diffusion (SVD) model pretrained on billions of frames as a strong backbone.<n>We evaluate on the challenging DL3DV-10K dataset and demonstrate superior performance compared to state-of-the-art methods.
arXiv Detail & Related papers (2026-02-26T03:58:42Z)
AlignVid: Training-Free Attention Scaling for Semantic Fidelity in Text-Guided Image-to-Video Generation [48.47444428530136]
Text-guided image-to-video (TI2V) generation has recently achieved remarkable progress, particularly in maintaining subject consistency and temporal coherence.<n>Existing methods still struggle to adhere to fine-grained prompt semantics, especially when prompts entail substantial transformations of the input image.<n>We introduce AlignVid, a training-free framework with two components: Attention Scaling Modulation (ASM) and Guidance Scheduling (GS)
arXiv Detail & Related papers (2025-12-01T06:53:48Z)
CloseUpShot: Close-up Novel View Synthesis from Sparse-views via Point-conditioned Diffusion Model [50.93869080795228]
Reconstructing 3D scenes and synthesizing novel views from sparse input views is a highly challenging task.<n>Recent advances in video diffusion models have demonstrated strong temporal reasoning capabilities.<n>We present a diffusion-based framework, called CloseUpShot, for close-up novel view synthesis from sparse inputs via point-conditioned video diffusion.
arXiv Detail & Related papers (2025-11-17T08:20:06Z)
VMDiff: Visual Mixing Diffusion for Limitless Cross-Object Synthesis [23.50866105623598]
We propose a diffusion-based framework that synthesizes a single, coherent object by integrating two input images at both noise and latent levels.<n>Our method outperforms strong baselines in visual quality, semantic consistency, and human-rated creativity.
arXiv Detail & Related papers (2025-09-28T03:17:58Z)
SFLD: Reducing the content bias for AI-generated Image Detection [23.152346805893373]
A novel approach, SFLD, incorporates PatchShuffle to integrate high-level semantic and low-level textural information.<n>Current benchmarks face challenges such as low image quality, insufficient content preservation, and limited class diversity.<n>In response, we introduce Twin Synths, a new benchmark generation methodology that constructs visually near-identical pairs of real and synthetic images.
arXiv Detail & Related papers (2025-02-24T12:38:34Z)
NVComposer: Boosting Generative Novel View Synthesis with Multiple Sparse and Unposed Images [50.36605863731669]
NVComposer is a novel approach that eliminates the need for explicit external alignment.<n> NVComposer achieves state-of-the-art performance in generative multi-view NVS tasks.<n>Our approach shows substantial improvements in synthesis quality as the number of unposed input views increases.
arXiv Detail & Related papers (2024-12-04T17:58:03Z)
MultiDiff: Consistent Novel View Synthesis from a Single Image [60.04215655745264]
MultiDiff is a novel approach for consistent novel view synthesis of scenes from a single RGB image. Our results demonstrate that MultiDiff outperforms state-of-the-art methods on the challenging, real-world datasets RealEstate10K and ScanNet.
arXiv Detail & Related papers (2024-06-26T17:53:51Z)
Consistent-1-to-3: Consistent Image to 3D View Synthesis via Geometry-aware Diffusion Models [16.326276673056334]
Consistent-1-to-3 is a generative framework that significantly mitigates this issue. We decompose the NVS task into two stages: (i) transforming observed regions to a novel view, and (ii) hallucinating unseen regions. We propose to employ epipolor-guided attention to incorporate geometry constraints, and multi-view attention to better aggregate multi-view information.
arXiv Detail & Related papers (2023-10-04T17:58:57Z)
Semantic Image Synthesis via Diffusion Models [174.24523061460704]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks.<n>Recent work on semantic image synthesis mainly follows the de facto GAN-based approaches.<n>We propose a novel framework based on DDPM for semantic image synthesis.
arXiv Detail & Related papers (2022-06-30T18:31:51Z)
In-N-Out Generative Learning for Dense Unsupervised Video Segmentation [89.21483504654282]
In this paper, we focus on the unsupervised Video Object (VOS) task which learns visual correspondence from unlabeled videos. We propose the In-aNd-Out (INO) generative learning from a purely generative perspective, which captures both high-level and fine-grained semantics. Our INO outperforms previous state-of-the-art methods by significant margins.
arXiv Detail & Related papers (2022-03-29T07:56:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.