Is Attention All NeRF Needs?
- URL: http://arxiv.org/abs/2207.13298v1
- Date: Wed, 27 Jul 2022 05:09:54 GMT
- Title: Is Attention All NeRF Needs?
- Authors: Mukund Varma T, Peihao Wang, Xuxi Chen, Tianlong Chen, Subhashini
Venugopalan, Zhangyang Wang
- Abstract summary: Generalizable NeRF Transformer (GNT) is a pure, unified transformer-based architecture that efficiently reconstructs Neural Radiance Fields (NeRFs) on the fly from source views.
GNT achieves generalizable neural scene representation and rendering, by encapsulating two transformer-based stages.
- Score: 103.51023982774599
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Generalizable NeRF Transformer (GNT), a pure, unified
transformer-based architecture that efficiently reconstructs Neural Radiance
Fields (NeRFs) on the fly from source views. Unlike prior works on NeRF that
optimize a per-scene implicit representation by inverting a handcrafted
rendering equation, GNT achieves generalizable neural scene representation and
rendering, by encapsulating two transformer-based stages. The first stage of
GNT, called view transformer, leverages multi-view geometry as an inductive
bias for attention-based scene representation, and predicts coordinate-aligned
features by aggregating information from epipolar lines on the neighboring
views. The second stage of GNT, named ray transformer, renders novel views by
ray marching and directly decodes the sequence of sampled point features using
the attention mechanism. Our experiments demonstrate that when optimized on a
single scene, GNT can successfully reconstruct NeRF without explicit rendering
formula, and even improve the PSNR by ~1.3dB on complex scenes due to the
learnable ray renderer. When trained across various scenes, GNT consistently
achieves the state-of-the-art performance when transferring to forward-facing
LLFF dataset (LPIPS ~20%, SSIM ~25%$) and synthetic blender dataset (LPIPS
~20%, SSIM ~4%). In addition, we show that depth and occlusion can be inferred
from the learned attention maps, which implies that the pure attention
mechanism is capable of learning a physically-grounded rendering process. All
these results bring us one step closer to the tantalizing hope of utilizing
transformers as the "universal modeling tool" even for graphics. Please refer
to our project page for video results: https://vita-group.github.io/GNT/.
Related papers
- CVT-xRF: Contrastive In-Voxel Transformer for 3D Consistent Radiance Fields from Sparse Inputs [65.80187860906115]
We propose a novel approach to improve NeRF's performance with sparse inputs.
We first adopt a voxel-based ray sampling strategy to ensure that the sampled rays intersect with a certain voxel in 3D space.
We then randomly sample additional points within the voxel and apply a Transformer to infer the properties of other points on each ray, which are then incorporated into the volume rendering.
arXiv Detail & Related papers (2024-03-25T15:56:17Z) - Enhancing NeRF akin to Enhancing LLMs: Generalizable NeRF Transformer
with Mixture-of-View-Experts [88.23732496104667]
Cross-scene generalizable NeRF models have become a new spotlight of the NeRF field.
We bridge "neuralized" architectures with the powerful Mixture-of-Experts (MoE) idea from large language models.
Our proposed model, dubbed GNT with Mixture-of-View-Experts (GNT-MOVE), has experimentally shown state-of-the-art results when transferring to unseen scenes.
arXiv Detail & Related papers (2023-08-22T21:18:54Z) - NeRF-SOS: Any-View Self-supervised Object Segmentation from Complex
Real-World Scenes [80.59831861186227]
This paper carries out the exploration of self-supervised learning for object segmentation using NeRF for complex real-world scenes.
Our framework, called NeRF with Self-supervised Object NeRF-SOS, encourages NeRF models to distill compact geometry-aware segmentation clusters.
It consistently surpasses other 2D-based self-supervised baselines and predicts finer semantics masks than existing supervised counterparts.
arXiv Detail & Related papers (2022-09-19T06:03:17Z) - End-to-end View Synthesis via NeRF Attention [71.06080186332524]
We present a simple seq2seq formulation for view synthesis where we take a set of ray points as input and output colors corresponding to the rays.
Inspired by the neural radiance field (NeRF), we propose the NeRF attention (NeRFA) to address the above problems.
NeRFA demonstrates superior performance over NeRF and NerFormer on four datasets.
arXiv Detail & Related papers (2022-07-29T15:26:16Z) - Generalizable Neural Radiance Fields for Novel View Synthesis with
Transformer [23.228142134527292]
We propose a Transformer-based NeRF (TransNeRF) to learn a generic neural radiance field conditioned on observed-view images.
Experiments demonstrate that our TransNeRF, trained on a wide variety of scenes, can achieve better performance in comparison to state-of-the-art image-based neural rendering methods.
arXiv Detail & Related papers (2022-06-10T23:16:43Z) - Transformer-Guided Convolutional Neural Network for Cross-View
Geolocalization [20.435023745201878]
We propose a novel Transformer-guided convolutional neural network (TransGCNN) architecture.
Our TransGCNN consists of a CNN backbone extracting feature map from an input image and a Transformer head modeling global context.
Experiments on popular benchmark datasets demonstrate that our model achieves top-1 accuracy of 94.12% and 84.92% on CVUSA and CVACT_val, respectively.
arXiv Detail & Related papers (2022-04-21T08:46:41Z) - Video Super-Resolution Transformer [85.11270760456826]
Video super-resolution (VSR), with the aim to restore a high-resolution video from its corresponding low-resolution version, is a spatial-temporal sequence prediction problem.
Recently, Transformer has been gaining popularity due to its parallel computing ability for sequence-to-sequence modeling.
In this paper, we present a spatial-temporal convolutional self-attention layer with a theoretical understanding to exploit the locality information.
arXiv Detail & Related papers (2021-06-12T20:00:32Z) - Rethinking Graph Transformers with Spectral Attention [13.068288784805901]
We present the $textitSpectral Attention Network$ (SAN), which uses a learned positional encoding (LPE) to learn the position of each node in a given graph.
By leveraging the full spectrum of the Laplacian, our model is theoretically powerful in distinguishing graphs, and can better detect similar sub-structures from their resonance.
Our model performs on par or better than state-of-the-art GNNs, and outperforms any attention-based model by a wide margin.
arXiv Detail & Related papers (2021-06-07T18:11:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.