CodedVTR: Codebook-based Sparse Voxel Transformer with Geometric
Guidance
- URL: http://arxiv.org/abs/2203.09887v1
- Date: Fri, 18 Mar 2022 11:50:25 GMT
- Title: CodedVTR: Codebook-based Sparse Voxel Transformer with Geometric
Guidance
- Authors: Tianchen Zhao, Niansong Zhang, Xuefei Ning, He Wang, Li Yi, Yu Wang
- Abstract summary: We propose CodedVTR (Codebook-based Voxel TRansformer) for 3D sparse voxel transformers.
On the one hand, we propose the codebook-based attention that projects an attention space into its subspace represented by the combination of "prototypes" in a learnable codebook.
On the other hand, we propose geometry-aware self-attention that utilizes geometric information (geometric pattern, density) to guide attention learning.
- Score: 22.39628991021092
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Transformers have gained much attention by outperforming convolutional neural
networks in many 2D vision tasks. However, they are known to have
generalization problems and rely on massive-scale pre-training and
sophisticated training techniques. When applying to 3D tasks, the irregular
data structure and limited data scale add to the difficulty of transformer's
application. We propose CodedVTR (Codebook-based Voxel TRansformer), which
improves data efficiency and generalization ability for 3D sparse voxel
transformers. On the one hand, we propose the codebook-based attention that
projects an attention space into its subspace represented by the combination of
"prototypes" in a learnable codebook. It regularizes attention learning and
improves generalization. On the other hand, we propose geometry-aware
self-attention that utilizes geometric information (geometric pattern, density)
to guide attention learning. CodedVTR could be embedded into existing sparse
convolution-based methods, and bring consistent performance improvements for
indoor and outdoor 3D semantic segmentation tasks
Related papers
- Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding [83.63231467746598]
We introduce Any2Point, a parameter-efficient method to empower any-modality large models (vision, language, audio) for 3D understanding.
We propose a 3D-to-any (1D or 2D) virtual projection strategy that correlates the input 3D points to the original 1D or 2D positions within the source modality.
arXiv Detail & Related papers (2024-04-11T17:59:45Z) - GTA: A Geometry-Aware Attention Mechanism for Multi-View Transformers [63.41460219156508]
We argue that existing positional encoding schemes are suboptimal for 3D vision tasks.
We propose a geometry-aware attention mechanism that encodes the geometric structure of tokens as relative transformation.
We show that our attention, called Geometric Transform Attention (GTA), improves learning efficiency and performance of state-of-the-art transformer-based NVS models.
arXiv Detail & Related papers (2023-10-16T13:16:09Z) - 3D Vision with Transformers: A Survey [114.86385193388439]
The success of the transformer architecture in natural language processing has triggered attention in the computer vision field.
We present a systematic and thorough review of more than 100 transformers methods for different 3D vision tasks.
We discuss transformer design in 3D vision, which allows it to process data with various 3D representations.
arXiv Detail & Related papers (2022-08-08T17:59:11Z) - Dual Octree Graph Networks for Learning Adaptive Volumetric Shape
Representations [21.59311861556396]
Our method encodes the volumetric field of a 3D shape with an adaptive feature volume organized by an octree.
An encoder-decoder network is designed to learn the adaptive feature volume based on the graph convolutions over the dual graph of octree nodes.
Our method effectively encodes shape details, enables fast 3D shape reconstruction, and exhibits good generality for modeling 3D shapes out of training categories.
arXiv Detail & Related papers (2022-05-05T17:56:34Z) - Geometry-Contrastive Transformer for Generalized 3D Pose Transfer [95.56457218144983]
The intuition of this work is to perceive the geometric inconsistency between the given meshes with the powerful self-attention mechanism.
We propose a novel geometry-contrastive Transformer that has an efficient 3D structured perceiving ability to the global geometric inconsistencies.
We present a latent isometric regularization module together with a novel semi-synthesized dataset for the cross-dataset 3D pose transfer task.
arXiv Detail & Related papers (2021-12-14T13:14:24Z) - PatchFormer: A Versatile 3D Transformer Based on Patch Attention [0.358439716487063]
We introduce patch-attention to adaptively learn a much smaller set of bases upon which the attention maps are computed.
By a weighted summation upon these bases, patch-attention not only captures the global shape context but also achieves linear complexity to input size.
Our network achieves strong accuracy on general 3D recognition tasks with 7.3x speed-up than previous 3D Transformers.
arXiv Detail & Related papers (2021-10-30T08:39:55Z) - The Neural Data Router: Adaptive Control Flow in Transformers Improves
Systematic Generalization [8.424405898986118]
We propose two modifications to the Transformer architecture, copy gate and geometric attention.
Our novel Neural Data Router (NDR) achieves 100% length generalization accuracy on the classic compositional table lookup task.
NDR's attention and gating patterns tend to be interpretable as an intuitive form of neural routing.
arXiv Detail & Related papers (2021-10-14T21:24:27Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - Adjoint Rigid Transform Network: Task-conditioned Alignment of 3D Shapes [86.2129580231191]
Adjoint Rigid Transform (ART) Network is a neural module which can be integrated with a variety of 3D networks.
ART learns to rotate input shapes to a learned canonical orientation, which is crucial for a lot of tasks.
We will release our code and pre-trained models for further research.
arXiv Detail & Related papers (2021-02-01T20:58:45Z) - Gram Regularization for Multi-view 3D Shape Retrieval [3.655021726150368]
We propose a novel regularization term called Gram regularization.
By forcing the variance between weight kernels to be large, the regularizer can help to extract discriminative features.
The proposed Gram regularization is data independent and can converge stably and quickly without bells and whistles.
arXiv Detail & Related papers (2020-11-16T05:37:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.