Related papers: Long-Range Grouping Transformer for Multi-View 3D Reconstruction

Long-Range Grouping Transformer for Multi-View 3D Reconstruction

URL: http://arxiv.org/abs/2308.08724v1
Date: Thu, 17 Aug 2023 01:34:59 GMT
Title: Long-Range Grouping Transformer for Multi-View 3D Reconstruction
Authors: Liying Yang, Zhenwei Zhu, Xuxin Lin, Jian Nong, Yanyan Liang
Abstract summary: Long-range grouping attention (LGA) based on the divide-and-conquer principle is proposed. An effective and efficient encoder can be established which connects inter-view features. A novel progressive upsampling decoder is also designed for voxel generation with relatively high resolution.
Score: 9.2709012704338
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Nowadays, transformer networks have demonstrated superior performance in many computer vision tasks. In a multi-view 3D reconstruction algorithm following this paradigm, self-attention processing has to deal with intricate image tokens including massive information when facing heavy amounts of view input. The curse of information content leads to the extreme difficulty of model learning. To alleviate this problem, recent methods compress the token number representing each view or discard the attention operations between the tokens from different views. Obviously, they give a negative impact on performance. Therefore, we propose long-range grouping attention (LGA) based on the divide-and-conquer principle. Tokens from all views are grouped for separate attention operations. The tokens in each group are sampled from all views and can provide macro representation for the resided view. The richness of feature learning is guaranteed by the diversity among different groups. An effective and efficient encoder can be established which connects inter-view features using LGA and extract intra-view features using the standard self-attention layer. Moreover, a novel progressive upsampling decoder is also designed for voxel generation with relatively high resolution. Hinging on the above, we construct a powerful transformer-based network, called LRGT. Experimental results on ShapeNet verify our method achieves SOTA accuracy in multi-view reconstruction. Code will be available at https://github.com/LiyingCV/Long-Range-Grouping-Transformer.

Related papers

Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding [65.11838260342586]
We present Pixel-SAIL, a single transformer for pixel-wise MLLM tasks. We propose a novel visual prompt injection strategy to enable the single transformer to understand visual prompt inputs. We also introduce a vision expert distillation strategy to efficiently enhance the single transformer's fine-grained feature extraction capability.
arXiv Detail & Related papers (2025-04-14T17:52:22Z)
Efficient Point Transformer with Dynamic Token Aggregating for Point Cloud Processing [19.73918716354272]
We propose an efficient point TransFormer with Dynamic Token Aggregating (DTA-Former) for point cloud representation and processing. It achieves SOTA performance with up to 30$times$ faster than prior point Transformers on ModelNet40, ShapeNet, and airborne MultiSpectral LiDAR (MS-LiDAR) datasets.
arXiv Detail & Related papers (2024-05-23T20:50:50Z)
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [83.85856356798531]
VistaLLM is a visual system that addresses coarse- and fine-grained vision-language tasks. It employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences. We also introduce a novel task, AttCoSeg, which boosts the model's reasoning and grounding capability over multiple input images.
arXiv Detail & Related papers (2023-12-19T18:53:01Z)
UMIFormer: Mining the Correlations between Similar Tokens for Multi-View 3D Reconstruction [9.874357856580447]
We propose a novel transformer network for Unstructured Multiple Images (UMIFormer) It exploits transformer blocks for decoupled intra-view encoding and designed blocks for token rectification. All tokens acquired from various branches are compressed into a fixed-size compact representation.
arXiv Detail & Related papers (2023-02-27T17:27:45Z)
MVTN: Learning Multi-View Transformations for 3D Understanding [60.15214023270087]
We introduce the Multi-View Transformation Network (MVTN), which uses differentiable rendering to determine optimal view-points for 3D shape recognition. MVTN can be trained end-to-end with any multi-view network for 3D shape recognition. Our approach demonstrates state-of-the-art performance in 3D classification and shape retrieval on several benchmarks.
arXiv Detail & Related papers (2022-12-27T12:09:16Z)
GPViT: A High Resolution Non-Hierarchical Vision Transformer with Group Propagation [25.689520892609213]
We present a novel nonhierarchical (i.e. non-pyramidal) transformer model for general visual recognition with high-resolution features. We evaluate GPViT on a variety of visual recognition tasks including image classification, semantic segmentation, object detection, and instance segmentation.
arXiv Detail & Related papers (2022-12-13T18:26:00Z)
Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer [91.49837514935051]
We propose a novel Vision Transformer, called Token Clustering Transformer (TCFormer) TCFormer merges tokens by progressive clustering, where the tokens can be merged from different locations with flexible shapes and sizes. Experiments show that TCFormer consistently outperforms its counterparts on different challenging human-centric tasks and datasets.
arXiv Detail & Related papers (2022-04-19T05:38:16Z)
MPViT: Multi-Path Vision Transformer for Dense Prediction [43.89623453679854]
Vision Transformers (ViTs) build a simple multi-stage structure for multi-scale representation with single-scale patches. OuriTs scaling from tiny(5M) to base(73M) consistently achieve superior performance over state-of-the-art Vision Transformers.
arXiv Detail & Related papers (2021-12-21T06:34:50Z)
Shunted Self-Attention via Multi-Scale Token Aggregation [124.16925784748601]
Recent Vision Transformer(ViT) models have demonstrated encouraging results across various computer vision tasks. We propose shunted self-attention(SSA) that allows ViTs to model the attentions at hybrid scales per attention layer. The SSA-based transformer achieves 84.0% Top-1 accuracy and outperforms the state-of-the-art Focal Transformer on ImageNet.
arXiv Detail & Related papers (2021-11-30T08:08:47Z)
XCiT: Cross-Covariance Image Transformers [73.33400159139708]
We propose a "transposed" version of self-attention that operates across feature channels rather than tokens. The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images.
arXiv Detail & Related papers (2021-06-17T17:33:35Z)
Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences. The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.