UMIFormer: Mining the Correlations between Similar Tokens for Multi-View
3D Reconstruction
- URL: http://arxiv.org/abs/2302.13987v2
- Date: Thu, 17 Aug 2023 12:34:41 GMT
- Title: UMIFormer: Mining the Correlations between Similar Tokens for Multi-View
3D Reconstruction
- Authors: Zhenwei Zhu, Liying Yang, Ning Li, Chaohao Jiang, Yanyan Liang
- Abstract summary: We propose a novel transformer network for Unstructured Multiple Images (UMIFormer)
It exploits transformer blocks for decoupled intra-view encoding and designed blocks for token rectification.
All tokens acquired from various branches are compressed into a fixed-size compact representation.
- Score: 9.874357856580447
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, many video tasks have achieved breakthroughs by utilizing
the vision transformer and establishing spatial-temporal decoupling for feature
extraction. Although multi-view 3D reconstruction also faces multiple images as
input, it cannot immediately inherit their success due to completely ambiguous
associations between unstructured views. There is not usable prior
relationship, which is similar to the temporally-coherence property in a video.
To solve this problem, we propose a novel transformer network for Unstructured
Multiple Images (UMIFormer). It exploits transformer blocks for decoupled
intra-view encoding and designed blocks for token rectification that mine the
correlation between similar tokens from different views to achieve decoupled
inter-view encoding. Afterward, all tokens acquired from various branches are
compressed into a fixed-size compact representation while preserving rich
information for reconstruction by leveraging the similarities between tokens.
We empirically demonstrate on ShapeNet and confirm that our decoupled learning
method is adaptable for unstructured multiple images. Meanwhile, the
experiments also verify our model outperforms existing SOTA methods by a large
margin. Code will be available at https://github.com/GaryZhu1996/UMIFormer.
Related papers
- ElasticTok: Adaptive Tokenization for Image and Video [109.75935878130582]
We introduce ElasticTok, a method that conditions on prior frames to adaptively encode a frame into a variable number of tokens.
During inference, ElasticTok can dynamically allocate tokens when needed.
Our evaluations on images and video demonstrate the effectiveness of our approach in efficient token usage.
arXiv Detail & Related papers (2024-10-10T20:54:15Z) - Multi-entity Video Transformers for Fine-Grained Video Representation
Learning [36.31020249963468]
We re-examine the design of transformer architectures for video representation learning.
A salient aspect of our self-supervised method is the improved integration of spatial information in the temporal pipeline.
Our Multi-entity Video Transformer (MV-Former) architecture achieves state-of-the-art results on multiple fine-grained video benchmarks.
arXiv Detail & Related papers (2023-11-17T21:23:12Z) - UnLoc: A Unified Framework for Video Localization Tasks [82.59118972890262]
UnLoc is a new approach for temporal localization in untrimmed videos.
It uses pretrained image and text towers, and feeds tokens to a video-text fusion model.
We achieve state of the art results on all three different localization tasks with a unified approach.
arXiv Detail & Related papers (2023-08-21T22:15:20Z) - Long-Range Grouping Transformer for Multi-View 3D Reconstruction [9.2709012704338]
Long-range grouping attention (LGA) based on the divide-and-conquer principle is proposed.
An effective and efficient encoder can be established which connects inter-view features.
A novel progressive upsampling decoder is also designed for voxel generation with relatively high resolution.
arXiv Detail & Related papers (2023-08-17T01:34:59Z) - Not All Tokens Are Equal: Human-centric Visual Analysis via Token
Clustering Transformer [91.49837514935051]
We propose a novel Vision Transformer, called Token Clustering Transformer (TCFormer)
TCFormer merges tokens by progressive clustering, where the tokens can be merged from different locations with flexible shapes and sizes.
Experiments show that TCFormer consistently outperforms its counterparts on different challenging human-centric tasks and datasets.
arXiv Detail & Related papers (2022-04-19T05:38:16Z) - SWAT: Spatial Structure Within and Among Tokens [53.525469741515884]
We argue that models can have significant gains when spatial structure is preserved during tokenization.
We propose two key contributions: (1) Structure-aware Tokenization and, (2) Structure-aware Mixing.
arXiv Detail & Related papers (2021-11-26T18:59:38Z) - Improving Visual Quality of Image Synthesis by A Token-based Generator
with Transformers [51.581926074686535]
We present a new perspective of achieving image synthesis by viewing this task as a visual token generation problem.
The proposed TokenGAN has achieved state-of-the-art results on several widely-used image synthesis benchmarks.
arXiv Detail & Related papers (2021-11-05T12:57:50Z) - LegoFormer: Transformers for Block-by-Block Multi-view 3D Reconstruction [45.16128577837725]
Most modern deep learning-based multi-view 3D reconstruction techniques use RNNs or fusion modules to combine information from multiple images after encoding them.
We propose LegoFormer, a transformer-based model that unifies object reconstruction under a single framework and parametrizes the reconstructed occupancy grid by its decomposition factors.
arXiv Detail & Related papers (2021-06-23T00:15:08Z) - VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive
Learning [82.09856883441044]
Video understanding relies on perceiving the global content modeling its internal connections.
We propose a block-wise strategy where we mask neighboring video tokens in both spatial and temporal domains.
We also add an augmentation-free contrastive learning method to further capture global content.
arXiv Detail & Related papers (2021-06-21T16:48:19Z) - CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image
Classification [17.709880544501758]
We propose a dual-branch transformer to combine image patches of different sizes to produce stronger image features.
Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity.
Our proposed cross-attention only requires linear time for both computational and memory complexity instead of quadratic time otherwise.
arXiv Detail & Related papers (2021-03-27T13:03:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.