TOSS:High-quality Text-guided Novel View Synthesis from a Single Image
- URL: http://arxiv.org/abs/2310.10644v1
- Date: Mon, 16 Oct 2023 17:59:09 GMT
- Title: TOSS:High-quality Text-guided Novel View Synthesis from a Single Image
- Authors: Yukai Shi, Jianan Wang, He Cao, Boshi Tang, Xianbiao Qi, Tianyu Yang,
Yukun Huang, Shilong Liu, Lei Zhang, Heung-Yeung Shum
- Abstract summary: We present TOSS, which introduces text to the task of novel view synthesis (NVS) from just a single RGB image.
To address this limitation, TOSS uses text as high-level semantic information to constrain the NVS solution space.
- Score: 36.90122394242858
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present TOSS, which introduces text to the task of novel
view synthesis (NVS) from just a single RGB image. While Zero-1-to-3 has
demonstrated impressive zero-shot open-set NVS capability, it treats NVS as a
pure image-to-image translation problem. This approach suffers from the
challengingly under-constrained nature of single-view NVS: the process lacks
means of explicit user control and often results in implausible NVS
generations. To address this limitation, TOSS uses text as high-level semantic
information to constrain the NVS solution space. TOSS fine-tunes text-to-image
Stable Diffusion pre-trained on large-scale text-image pairs and introduces
modules specifically tailored to image and camera pose conditioning, as well as
dedicated training for pose correctness and preservation of fine details.
Comprehensive experiments are conducted with results showing that our proposed
TOSS outperforms Zero-1-to-3 with more plausible, controllable and
multiview-consistent NVS results. We further support these results with
comprehensive ablations that underscore the effectiveness and potential of the
introduced semantic guidance and architecture design.
Related papers
- CheapNVS: Real-Time On-Device Narrow-Baseline Novel View Synthesis [2.4578723416255754]
Single-view novel view synthesis (NVS) is a notorious problem due to its ill-posed nature, and often requires large, computationally expensive approaches to produce tangible results.
We propose CheapNVS: a fully end-to-end approach for narrow baseline single-view NVS based on a novel, efficient multiple encoder/decoder design trained in a multi-stage fashion.
arXiv Detail & Related papers (2025-01-24T14:40:39Z) - Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning [70.98890307376548]
We propose a novel Patch-wise Cross-modal feature Mix-up (PCM) mechanism to adaptively mitigate the unfaithful contents during training.
Our PCM-Net ranks first in both in-domain and cross-domain zero-shot image captioning.
arXiv Detail & Related papers (2024-12-31T13:39:08Z) - Ctrl123: Consistent Novel View Synthesis via Closed-Loop Transcription [23.517622316025772]
Large image diffusion models have demonstrated zero-shot capability in novel view synthesis (NVS)
Existing diffusion-based NVS methods struggle to generate novel views that are accurately consistent with the corresponding ground truth poses and appearances.
We propose Ctrl123, a closed-loop transcription-based NVS diffusion method that enforces alignment between the generated view and ground truth in a pose-sensitive feature space.
arXiv Detail & Related papers (2024-03-16T15:39:23Z) - Novel View Synthesis with View-Dependent Effects from a Single Image [35.85973300177698]
We first consider view-dependent effects into single image-based novel view synthesis (NVS) problems.
We propose to exploit the camera motion priors in NVS to model view-dependent appearance or effects (VDE) as the negative disparity in the scene.
We present extensive experiment results and show that our proposed method can learn NVS with VDEs, outperforming the SOTA single-view NVS methods on the RealEstate10k and MannequinChallenge datasets.
arXiv Detail & Related papers (2023-12-13T11:29:47Z) - UniFine: A Unified and Fine-grained Approach for Zero-shot
Vision-Language Understanding [84.83494254263138]
We propose a unified framework to take advantage of the fine-grained information for zero-shot vision-language learning.
Our framework outperforms former zero-shot methods on VQA and achieves substantial improvement on SNLI-VE and VCR.
arXiv Detail & Related papers (2023-07-03T09:03:12Z) - Implicit Neural Representation for Cooperative Low-light Image
Enhancement [10.484180571326565]
We propose an implicit Neural Representation method for Cooperative low-light image enhancement, dubbed NeRCo.
NeRCo unifies the diverse degradation factors of real-world scenes with a controllable fitting function, leading to better robustness.
For the output results, we introduce semantic-orientated supervision with priors from the pre-trained vision-language model.
arXiv Detail & Related papers (2023-03-21T10:24:29Z) - High-Fidelity Guided Image Synthesis with Latent Diffusion Models [50.39294302741698]
The proposed approach outperforms the previous state-of-the-art by over 85.32% on the overall user satisfaction scores.
Human user study results show that the proposed approach outperforms the previous state-of-the-art by over 85.32% on the overall user satisfaction scores.
arXiv Detail & Related papers (2022-11-30T15:43:20Z) - Hyperspectral Image Super-resolution via Deep Progressive Zero-centric
Residual Learning [62.52242684874278]
Cross-modality distribution of spatial and spectral information makes the problem challenging.
We propose a novel textitlightweight deep neural network-based framework, namely PZRes-Net.
Our framework learns a high resolution and textitzero-centric residual image, which contains high-frequency spatial details of the scene.
arXiv Detail & Related papers (2020-06-18T06:32:11Z) - Scene Text Image Super-Resolution in the Wild [112.90416737357141]
Low-resolution text images are often seen in natural scenes such as documents captured by mobile phones.
Previous single image super-resolution (SISR) methods are trained on synthetic low-resolution images.
We pro-pose a real scene text SR dataset, termed TextZoom.
It contains paired real low-resolution and high-resolution images captured by cameras with different focal length in the wild.
arXiv Detail & Related papers (2020-05-07T09:18:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.