SnakeVoxFormer: Transformer-based Single Image\\Voxel Reconstruction
with Run Length Encoding
- URL: http://arxiv.org/abs/2303.16293v1
- Date: Tue, 28 Mar 2023 20:16:13 GMT
- Title: SnakeVoxFormer: Transformer-based Single Image\\Voxel Reconstruction
with Run Length Encoding
- Authors: Jae Joong Lee, Bedrich Benes
- Abstract summary: SnakeVoxFormer is a novel, 3D object reconstruction in voxel space from a single image using the transformer.
We show how different voxel traversing strategies affect the effect of encoding and reconstruction.
- Score: 9.691609196086015
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep learning-based 3D object reconstruction has achieved unprecedented
results. Among those, the transformer deep neural model showed outstanding
performance in many applications of computer vision. We introduce
SnakeVoxFormer, a novel, 3D object reconstruction in voxel space from a single
image using the transformer. The input to SnakeVoxFormer is a 2D image, and the
result is a 3D voxel model. The key novelty of our approach is in using the
run-length encoding that traverses (like a snake) the voxel space and encodes
wide spatial differences into a 1D structure that is suitable for transformer
encoding. We then use dictionary encoding to convert the discovered RLE blocks
into tokens that are used for the transformer. The 1D representation is a
lossless 3D shape data compression method that converts to 1D data that use
only about 1% of the original data size. We show how different voxel traversing
strategies affect the effect of encoding and reconstruction. We compare our
method with the state-of-the-art for 3D voxel reconstruction from images and
our method improves the state-of-the-art methods by at least 2.8% and up to
19.8%.
Related papers
- SCube: Instant Large-Scale Scene Reconstruction using VoxSplats [55.383993296042526]
We present SCube, a novel method for reconstructing large-scale 3D scenes (geometry, appearance, and semantics) from a sparse set of posed images.
Our method encodes reconstructed scenes using a novel representation VoxSplat, which is a set of 3D Gaussians supported on a high-resolution sparse-voxel scaffold.
arXiv Detail & Related papers (2024-10-26T00:52:46Z) - DIG3D: Marrying Gaussian Splatting with Deformable Transformer for Single Image 3D Reconstruction [12.408610403423559]
We propose a novel approach called DIG3D for 3D object reconstruction and novel view synthesis.
Our method utilizes an encoder-decoder framework which generates 3D Gaussians in decoder with the guidance of depth-aware image features from encoder.
We evaluate our method on the ShapeNet SRN dataset, getting PSNR of 24.21 and 24.98 in car and chair dataset, respectively.
arXiv Detail & Related papers (2024-04-25T04:18:59Z) - Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding [83.63231467746598]
We introduce Any2Point, a parameter-efficient method to empower any-modality large models (vision, language, audio) for 3D understanding.
We propose a 3D-to-any (1D or 2D) virtual projection strategy that correlates the input 3D points to the original 1D or 2D positions within the source modality.
arXiv Detail & Related papers (2024-04-11T17:59:45Z) - Real-Time Radiance Fields for Single-Image Portrait View Synthesis [85.32826349697972]
We present a one-shot method to infer and render a 3D representation from a single unposed image in real-time.
Given a single RGB input, our image encoder directly predicts a canonical triplane representation of a neural radiance field for 3D-aware novel view synthesis via volume rendering.
Our method is fast (24 fps) on consumer hardware, and produces higher quality results than strong GAN-inversion baselines that require test-time optimization.
arXiv Detail & Related papers (2023-05-03T17:56:01Z) - VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene
Completion [129.5975573092919]
VoxFormer is a Transformer-based semantic scene completion framework.
It can output complete 3D semantics from only 2D images.
Our framework outperforms the state of the art with a relative improvement of 20.0% in geometry and 18.1% in semantics.
arXiv Detail & Related papers (2023-02-23T18:59:36Z) - Efficient 3D Object Reconstruction using Visual Transformers [4.670344336401625]
We set out to use visual transformers in place of convolutions in 3D object reconstruction.
Using a transformer-based encoder and decoder to predict 3D structure from 2D images, we achieve accuracy similar or superior to the baseline approach.
arXiv Detail & Related papers (2023-02-16T18:33:25Z) - Cats: Complementary CNN and Transformer Encoders for Segmentation [13.288195115791758]
We propose a model with double encoders for 3D biomedical image segmentation.
We fuse the information from the convolutional encoder and the transformer, and pass it to the decoder to obtain the results.
Compared to the state-of-the-art models with and without transformers on each task, our proposed method obtains higher Dice scores across the board.
arXiv Detail & Related papers (2022-08-24T14:25:11Z) - AFTer-UNet: Axial Fusion Transformer UNet for Medical Image Segmentation [19.53151547706724]
transformer-based models have drawn attention to exploring these techniques in medical image segmentation.
We propose Axial Fusion Transformer UNet (AFTer-UNet), which takes both advantages of convolutional layers' capability of extracting detailed features and transformers' strength on long sequence modeling.
It has fewer parameters and takes less GPU memory to train than the previous transformer-based models.
arXiv Detail & Related papers (2021-10-20T06:47:28Z) - Pix2Vox++: Multi-scale Context-aware 3D Object Reconstruction from
Single and Multiple Images [56.652027072552606]
We propose a novel framework for single-view and multi-view 3D object reconstruction, named Pix2Vox++.
By using a well-designed encoder-decoder, it generates a coarse 3D volume from each input image.
A multi-scale context-aware fusion module is then introduced to adaptively select high-quality reconstructions for different parts from all coarse 3D volumes to obtain a fused 3D volume.
arXiv Detail & Related papers (2020-06-22T13:48:09Z) - Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled
Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras.
We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points.
Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.