CodedVTR: Codebook-based Sparse Voxel Transformer with Geometric
Guidance
- URL: http://arxiv.org/abs/2203.09887v1
- Date: Fri, 18 Mar 2022 11:50:25 GMT
- Title: CodedVTR: Codebook-based Sparse Voxel Transformer with Geometric
Guidance
- Authors: Tianchen Zhao, Niansong Zhang, Xuefei Ning, He Wang, Li Yi, Yu Wang
- Abstract summary: We propose CodedVTR (Codebook-based Voxel TRansformer) for 3D sparse voxel transformers.
On the one hand, we propose the codebook-based attention that projects an attention space into its subspace represented by the combination of "prototypes" in a learnable codebook.
On the other hand, we propose geometry-aware self-attention that utilizes geometric information (geometric pattern, density) to guide attention learning.
- Score: 22.39628991021092
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Transformers have gained much attention by outperforming convolutional neural
networks in many 2D vision tasks. However, they are known to have
generalization problems and rely on massive-scale pre-training and
sophisticated training techniques. When applying to 3D tasks, the irregular
data structure and limited data scale add to the difficulty of transformer's
application. We propose CodedVTR (Codebook-based Voxel TRansformer), which
improves data efficiency and generalization ability for 3D sparse voxel
transformers. On the one hand, we propose the codebook-based attention that
projects an attention space into its subspace represented by the combination of
"prototypes" in a learnable codebook. It regularizes attention learning and
improves generalization. On the other hand, we propose geometry-aware
self-attention that utilizes geometric information (geometric pattern, density)
to guide attention learning. CodedVTR could be embedded into existing sparse
convolution-based methods, and bring consistent performance improvements for
indoor and outdoor 3D semantic segmentation tasks
Related papers
- GaussRender: Learning 3D Occupancy with Gaussian Rendering [84.60008381280286]
GaussRender is a plug-and-play 3D-to-2D reprojection loss that enhances voxel-based supervision.
Our method projects 3D voxel representations into arbitrary 2D perspectives and leverages Gaussian splatting as an efficient, differentiable rendering proxy of voxels.
arXiv Detail & Related papers (2025-02-07T16:07:51Z) - GEAL: Generalizable 3D Affordance Learning with Cross-Modal Consistency [50.11520458252128]
Existing 3D affordance learning methods struggle with generalization and robustness due to limited annotated data.
We propose GEAL, a novel framework designed to enhance the generalization and robustness of 3D affordance learning by leveraging large-scale pre-trained 2D models.
GEAL consistently outperforms existing methods across seen and novel object categories, as well as corrupted data.
arXiv Detail & Related papers (2024-12-12T17:59:03Z) - GTA: A Geometry-Aware Attention Mechanism for Multi-View Transformers [63.41460219156508]
We argue that existing positional encoding schemes are suboptimal for 3D vision tasks.
We propose a geometry-aware attention mechanism that encodes the geometric structure of tokens as relative transformation.
We show that our attention, called Geometric Transform Attention (GTA), improves learning efficiency and performance of state-of-the-art transformer-based NVS models.
arXiv Detail & Related papers (2023-10-16T13:16:09Z) - Dual Octree Graph Networks for Learning Adaptive Volumetric Shape
Representations [21.59311861556396]
Our method encodes the volumetric field of a 3D shape with an adaptive feature volume organized by an octree.
An encoder-decoder network is designed to learn the adaptive feature volume based on the graph convolutions over the dual graph of octree nodes.
Our method effectively encodes shape details, enables fast 3D shape reconstruction, and exhibits good generality for modeling 3D shapes out of training categories.
arXiv Detail & Related papers (2022-05-05T17:56:34Z) - Geometry-Contrastive Transformer for Generalized 3D Pose Transfer [95.56457218144983]
The intuition of this work is to perceive the geometric inconsistency between the given meshes with the powerful self-attention mechanism.
We propose a novel geometry-contrastive Transformer that has an efficient 3D structured perceiving ability to the global geometric inconsistencies.
We present a latent isometric regularization module together with a novel semi-synthesized dataset for the cross-dataset 3D pose transfer task.
arXiv Detail & Related papers (2021-12-14T13:14:24Z) - PatchFormer: A Versatile 3D Transformer Based on Patch Attention [0.358439716487063]
We introduce patch-attention to adaptively learn a much smaller set of bases upon which the attention maps are computed.
By a weighted summation upon these bases, patch-attention not only captures the global shape context but also achieves linear complexity to input size.
Our network achieves strong accuracy on general 3D recognition tasks with 7.3x speed-up than previous 3D Transformers.
arXiv Detail & Related papers (2021-10-30T08:39:55Z) - The Neural Data Router: Adaptive Control Flow in Transformers Improves
Systematic Generalization [8.424405898986118]
We propose two modifications to the Transformer architecture, copy gate and geometric attention.
Our novel Neural Data Router (NDR) achieves 100% length generalization accuracy on the classic compositional table lookup task.
NDR's attention and gating patterns tend to be interpretable as an intuitive form of neural routing.
arXiv Detail & Related papers (2021-10-14T21:24:27Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - Adjoint Rigid Transform Network: Task-conditioned Alignment of 3D Shapes [86.2129580231191]
Adjoint Rigid Transform (ART) Network is a neural module which can be integrated with a variety of 3D networks.
ART learns to rotate input shapes to a learned canonical orientation, which is crucial for a lot of tasks.
We will release our code and pre-trained models for further research.
arXiv Detail & Related papers (2021-02-01T20:58:45Z) - Gram Regularization for Multi-view 3D Shape Retrieval [3.655021726150368]
We propose a novel regularization term called Gram regularization.
By forcing the variance between weight kernels to be large, the regularizer can help to extract discriminative features.
The proposed Gram regularization is data independent and can converge stably and quickly without bells and whistles.
arXiv Detail & Related papers (2020-11-16T05:37:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.