MM-NeRF: Multimodal-Guided 3D Multi-Style Transfer of Neural Radiance Field
- URL: http://arxiv.org/abs/2309.13607v3
- Date: Fri, 24 Jan 2025 16:37:45 GMT
- Title: MM-NeRF: Multimodal-Guided 3D Multi-Style Transfer of Neural Radiance Field
- Authors: Zijiang Yang, Zhongwei Qiu, Chang Xu, Dongmei Fu,
- Abstract summary: We propose a novel Multimodal-guided 3D Multi-style transfer of NeRF, termed MM-NeRF.<n> MM-NeRF projects multimodal guidance into a unified space to keep the multimodal styles consistency and extracts multimodal features to guide the 3D stylization.<n>Experiments on several real-world datasets show that MM-NeRF achieves high-quality 3D multi-style stylization with multimodal guidance.
- Score: 23.050381521558414
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D style transfer aims to generate stylized views of 3D scenes with specified styles, which requires high-quality generating and keeping multi-view consistency. Existing methods still suffer the challenges of high-quality stylization with texture details and stylization with multimodal guidance. In this paper, we reveal that the common training method of stylization with NeRF, which generates stylized multi-view supervision by 2D style transfer models, causes the same object in supervision to show various states (color tone, details, etc.) in different views, leading NeRF to tend to smooth the texture details, further resulting in low-quality rendering for 3D multi-style transfer. To tackle these problems, we propose a novel Multimodal-guided 3D Multi-style transfer of NeRF, termed MM-NeRF. First, MM-NeRF projects multimodal guidance into a unified space to keep the multimodal styles consistency and extracts multimodal features to guide the 3D stylization. Second, a novel multi-head learning scheme is proposed to relieve the difficulty of learning multi-style transfer, and a multi-view style consistent loss is proposed to track the inconsistency of multi-view supervision data. Finally, a novel incremental learning mechanism is proposed to generalize MM-NeRF to any new style with small costs. Extensive experiments on several real-world datasets show that MM-NeRF achieves high-quality 3D multi-style stylization with multimodal guidance, and keeps multi-view consistency and style consistency between multimodal guidance.
Related papers
- Multi-level Dynamic Style Transfer for NeRFs [40.439070690681]
MDS-NeRF is a novel approach that reengineers the NeRF pipeline specifically for stylization.<n>We propose a multi-level feature adaptor that helps generate a multi-level feature grid representation from the content radiance field.<n>We also present a dynamic style injection module that learns to extract relevant style features and adaptively integrates them into the content patterns.
arXiv Detail & Related papers (2025-10-01T07:19:27Z) - MANGO: Multimodal Attention-based Normalizing Flow Approach to Fusion Learning [12.821814562210632]
This paper introduces a novel Multimodal Attention-based Normalizing Flow (MANGO) approach.<n>We propose a new Invertible Cross-Attention layer to develop the Normalizing Flow-based Model for multimodal data.<n>We also introduce three new cross-attention mechanisms: Modality-to-Modality Cross-Attention (MMCA), Inter-Modality Cross-Attention (IMCA), and Learnable Inter-Modality Cross-Attention (LICA)
arXiv Detail & Related papers (2025-08-13T18:56:57Z) - DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation [54.30327187663316]
DiTCtrl is a training-free multi-prompt video generation method under MM-DiT architectures for the first time.
We analyze MM-DiT's attention mechanism, finding that the 3D full attention behaves similarly to that of the cross/self-attention blocks in the UNet-like diffusion models.
Based on our careful design, the video generated by DiTCtrl achieves smooth transitions and consistent object motion given multiple sequential prompts.
arXiv Detail & Related papers (2024-12-24T18:51:19Z) - 3DEnhancer: Consistent Multi-View Diffusion for 3D Enhancement [66.8116563135326]
We present 3DEnhancer, which employs a multi-view latent diffusion model to enhance coarse 3D inputs while preserving multi-view consistency.
Unlike existing video-based approaches, our model supports seamless multi-view enhancement with improved coherence across diverse viewing angles.
arXiv Detail & Related papers (2024-12-24T17:36:34Z) - Towards Multi-View Consistent Style Transfer with One-Step Diffusion via Vision Conditioning [12.43848969320173]
Stylized images from different viewpoints generated by our method achieve superior visual quality, with better structural integrity and less distortion.
Our method effectively preserves the structural information and multi-view consistency in stylized images without any 3D information.
arXiv Detail & Related papers (2024-11-15T12:02:07Z) - G3DST: Generalizing 3D Style Transfer with Neural Radiance Fields across Scenes and Styles [45.92812062685523]
Existing methods for 3D style transfer need extensive per-scene optimization for single or multiple styles.
In this work, we overcome the limitations of existing methods by rendering stylized novel views from a NeRF without the need for per-scene or per-style optimization.
Our findings demonstrate that this approach achieves a good visual quality comparable to that of per-scene methods.
arXiv Detail & Related papers (2024-08-24T08:04:19Z) - Style-NeRF2NeRF: 3D Style Transfer From Style-Aligned Multi-View Images [54.56070204172398]
We propose a simple yet effective pipeline for stylizing a 3D scene.
We perform 3D style transfer by refining the source NeRF model using stylized images generated by a style-aligned image-to-image diffusion model.
We demonstrate that our method can transfer diverse artistic styles to real-world 3D scenes with competitive quality.
arXiv Detail & Related papers (2024-06-19T09:36:18Z) - Vivid-ZOO: Multi-View Video Generation with Diffusion Model [76.96449336578286]
New challenges lie in the lack of massive captioned multi-view videos and the complexity of modeling such multi-dimensional distribution.
We propose a novel diffusion-based pipeline that generates high-quality multi-view videos centered around a dynamic 3D object from text.
arXiv Detail & Related papers (2024-06-12T21:44:04Z) - Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data [80.92268916571712]
A critical bottleneck is the scarcity of high-quality 3D objects with detailed captions.
We propose Bootstrap3D, a novel framework that automatically generates an arbitrary quantity of multi-view images.
We have generated 1 million high-quality synthetic multi-view images with dense descriptive captions.
arXiv Detail & Related papers (2024-05-31T17:59:56Z) - ${M^2D}$NeRF: Multi-Modal Decomposition NeRF with 3D Feature Fields [33.168225243348786]
We present a single model, Multi-Modal Decomposition NeRF ($M2D$NeRF), that is capable of both text-based and visual patch-based edits.
Specifically, we use multi-modal feature distillation to integrate teacher features from pretrained visual and language models into 3D semantic feature volumes.
Experiments on various real-world scenes show superior performance in 3D scene decomposition tasks compared to prior NeRF-based methods.
arXiv Detail & Related papers (2024-05-08T12:25:21Z) - FPRF: Feed-Forward Photorealistic Style Transfer of Large-Scale 3D
Neural Radiance Fields [23.705795612467956]
FPRF stylizes large-scale 3D scenes with arbitrary, multiple style reference images without additional optimization.
FPRF achieves favorable photorealistic quality 3D scene stylization for large-scale scenes with diverse reference images.
arXiv Detail & Related papers (2024-01-10T19:27:28Z) - StyleRF: Zero-shot 3D Style Transfer of Neural Radiance Fields [52.19291190355375]
StyleRF (Style Radiance Fields) is an innovative 3D style transfer technique.
It employs an explicit grid of high-level features to represent 3D scenes, with which high-fidelity geometry can be reliably restored via volume rendering.
It transforms the grid features according to the reference style which directly leads to high-quality zero-shot style transfer.
arXiv Detail & Related papers (2023-03-19T08:26:06Z) - 3DSNet: Unsupervised Shape-to-Shape 3D Style Transfer [66.48720190245616]
We propose a learning-based approach for style transfer between 3D objects.
The proposed method can synthesize new 3D shapes both in the form of point clouds and meshes.
We extend our technique to implicitly learn the multimodal style distribution of the chosen domains.
arXiv Detail & Related papers (2020-11-26T16:59:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.