Long-Range Grouping Transformer for Multi-View 3D Reconstruction
- URL: http://arxiv.org/abs/2308.08724v1
- Date: Thu, 17 Aug 2023 01:34:59 GMT
- Title: Long-Range Grouping Transformer for Multi-View 3D Reconstruction
- Authors: Liying Yang, Zhenwei Zhu, Xuxin Lin, Jian Nong, Yanyan Liang
- Abstract summary: Long-range grouping attention (LGA) based on the divide-and-conquer principle is proposed.
An effective and efficient encoder can be established which connects inter-view features.
A novel progressive upsampling decoder is also designed for voxel generation with relatively high resolution.
- Score: 9.2709012704338
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Nowadays, transformer networks have demonstrated superior performance in many
computer vision tasks. In a multi-view 3D reconstruction algorithm following
this paradigm, self-attention processing has to deal with intricate image
tokens including massive information when facing heavy amounts of view input.
The curse of information content leads to the extreme difficulty of model
learning. To alleviate this problem, recent methods compress the token number
representing each view or discard the attention operations between the tokens
from different views. Obviously, they give a negative impact on performance.
Therefore, we propose long-range grouping attention (LGA) based on the
divide-and-conquer principle. Tokens from all views are grouped for separate
attention operations. The tokens in each group are sampled from all views and
can provide macro representation for the resided view. The richness of feature
learning is guaranteed by the diversity among different groups. An effective
and efficient encoder can be established which connects inter-view features
using LGA and extract intra-view features using the standard self-attention
layer. Moreover, a novel progressive upsampling decoder is also designed for
voxel generation with relatively high resolution. Hinging on the above, we
construct a powerful transformer-based network, called LRGT. Experimental
results on ShapeNet verify our method achieves SOTA accuracy in multi-view
reconstruction. Code will be available at
https://github.com/LiyingCV/Long-Range-Grouping-Transformer.
Related papers
- Efficient Point Transformer with Dynamic Token Aggregating for Point Cloud Processing [19.73918716354272]
We propose an efficient point TransFormer with Dynamic Token Aggregating (DTA-Former) for point cloud representation and processing.
It achieves SOTA performance with up to 30$times$ faster than prior point Transformers on ModelNet40, ShapeNet, and airborne MultiSpectral LiDAR (MS-LiDAR) datasets.
arXiv Detail & Related papers (2024-05-23T20:50:50Z) - Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [83.85856356798531]
VistaLLM is a visual system that addresses coarse- and fine-grained vision-language tasks.
It employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences.
We also introduce a novel task, AttCoSeg, which boosts the model's reasoning and grounding capability over multiple input images.
arXiv Detail & Related papers (2023-12-19T18:53:01Z) - UMIFormer: Mining the Correlations between Similar Tokens for Multi-View
3D Reconstruction [9.874357856580447]
We propose a novel transformer network for Unstructured Multiple Images (UMIFormer)
It exploits transformer blocks for decoupled intra-view encoding and designed blocks for token rectification.
All tokens acquired from various branches are compressed into a fixed-size compact representation.
arXiv Detail & Related papers (2023-02-27T17:27:45Z) - MVTN: Learning Multi-View Transformations for 3D Understanding [60.15214023270087]
We introduce the Multi-View Transformation Network (MVTN), which uses differentiable rendering to determine optimal view-points for 3D shape recognition.
MVTN can be trained end-to-end with any multi-view network for 3D shape recognition.
Our approach demonstrates state-of-the-art performance in 3D classification and shape retrieval on several benchmarks.
arXiv Detail & Related papers (2022-12-27T12:09:16Z) - GPViT: A High Resolution Non-Hierarchical Vision Transformer with Group
Propagation [25.689520892609213]
We present a novel nonhierarchical (i.e. non-pyramidal) transformer model for general visual recognition with high-resolution features.
We evaluate GPViT on a variety of visual recognition tasks including image classification, semantic segmentation, object detection, and instance segmentation.
arXiv Detail & Related papers (2022-12-13T18:26:00Z) - Not All Tokens Are Equal: Human-centric Visual Analysis via Token
Clustering Transformer [91.49837514935051]
We propose a novel Vision Transformer, called Token Clustering Transformer (TCFormer)
TCFormer merges tokens by progressive clustering, where the tokens can be merged from different locations with flexible shapes and sizes.
Experiments show that TCFormer consistently outperforms its counterparts on different challenging human-centric tasks and datasets.
arXiv Detail & Related papers (2022-04-19T05:38:16Z) - MPViT: Multi-Path Vision Transformer for Dense Prediction [43.89623453679854]
Vision Transformers (ViTs) build a simple multi-stage structure for multi-scale representation with single-scale patches.
OuriTs scaling from tiny(5M) to base(73M) consistently achieve superior performance over state-of-the-art Vision Transformers.
arXiv Detail & Related papers (2021-12-21T06:34:50Z) - Shunted Self-Attention via Multi-Scale Token Aggregation [124.16925784748601]
Recent Vision Transformer(ViT) models have demonstrated encouraging results across various computer vision tasks.
We propose shunted self-attention(SSA) that allows ViTs to model the attentions at hybrid scales per attention layer.
The SSA-based transformer achieves 84.0% Top-1 accuracy and outperforms the state-of-the-art Focal Transformer on ImageNet.
arXiv Detail & Related papers (2021-11-30T08:08:47Z) - XCiT: Cross-Covariance Image Transformers [73.33400159139708]
We propose a "transposed" version of self-attention that operates across feature channels rather than tokens.
The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images.
arXiv Detail & Related papers (2021-06-17T17:33:35Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.