ViewFormer: View Set Attention for Multi-view 3D Shape Understanding
- URL: http://arxiv.org/abs/2305.00161v1
- Date: Sat, 29 Apr 2023 03:58:20 GMT
- Title: ViewFormer: View Set Attention for Multi-view 3D Shape Understanding
- Authors: Hongyu Sun, Yongcai Wang, Peng Wang, Xudong Cai, Deying Li
- Abstract summary: We present ViewFormer, a model for multi-view 3d shape recognition and retrieval.
With only 2 attention blocks and 4.8M learnable parameters, ViewFormer reaches 98.8% recognition accuracy on ModelNet40 for the first time.
On the challenging RGBD dataset, our method achieves 98.4% recognition accuracy, which is a 4.1% absolute improvement over the strongest baseline.
- Score: 7.39435265842079
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper presents ViewFormer, a simple yet effective model for multi-view
3d shape recognition and retrieval. We systematically investigate the existing
methods for aggregating multi-view information and propose a novel ``view set"
perspective, which minimizes the relation assumption about the views and
releases the representation flexibility. We devise an adaptive attention model
to capture pairwise and higher-order correlations of the elements in the view
set. The learned multi-view correlations are aggregated into an expressive view
set descriptor for recognition and retrieval. Experiments show the proposed
method unleashes surprising capabilities across different tasks and datasets.
For instance, with only 2 attention blocks and 4.8M learnable parameters,
ViewFormer reaches 98.8% recognition accuracy on ModelNet40 for the first time,
exceeding previous best method by 1.1% . On the challenging RGBD dataset, our
method achieves 98.4% recognition accuracy, which is a 4.1% absolute
improvement over the strongest baseline. ViewFormer also sets new records in
several evaluation dimensions of 3D shape retrieval defined on the SHREC'17
benchmark.
Related papers
- VSFormer: Mining Correlations in Flexible View Set for Multi-view 3D Shape Understanding [9.048401253308123]
This paper investigates flexible organization and explicit correlation learning for multiple views.
We devise a nimble Transformer model, named emphVSFormer, to explicitly capture pairwise and higher-order correlations of all elements in the set.
It reaches state-of-the-art results on various 3d recognition datasets, including ModelNet40, ScanObjectNN and RGBD.
arXiv Detail & Related papers (2024-09-14T01:48:54Z) - Enhancing Person Re-Identification via Uncertainty Feature Fusion and Auto-weighted Measure Combination [1.183049138259841]
Person re-identification (Re-ID) is a challenging task that involves identifying the same person across different camera views in surveillance systems.
In this paper, a new approach is introduced that enhances the capability of ReID models through the Uncertain Feature Fusion Method (UFFM) and Auto-weighted Measure Combination (AMC)
Our method significantly improves Rank@1 accuracy and Mean Average Precision (mAP) when evaluated on person re-identification datasets.
arXiv Detail & Related papers (2024-05-02T09:09:48Z) - OpenShape: Scaling Up 3D Shape Representation Towards Open-World
Understanding [53.21204584976076]
We introduce OpenShape, a method for learning multi-modal joint representations of text, image, and point clouds.
We scale up training data by ensembling multiple 3D datasets and propose several strategies to automatically filter and enrich noisy text descriptions.
We evaluate OpenShape on zero-shot 3D classification benchmarks and demonstrate its superior capabilities for open-world recognition.
arXiv Detail & Related papers (2023-05-18T07:07:19Z) - Cross-view Graph Contrastive Representation Learning on Partially
Aligned Multi-view Data [52.491074276133325]
Multi-view representation learning has developed rapidly over the past decades and has been applied in many fields.
We propose a new cross-view graph contrastive learning framework, which integrates multi-view information to align data and learn latent representations.
Experiments conducted on several real datasets demonstrate the effectiveness of the proposed method on the clustering and classification tasks.
arXiv Detail & Related papers (2022-11-08T09:19:32Z) - Direct Multi-view Multi-person 3D Pose Estimation [138.48139701871213]
We present Multi-view Pose transformer (MvP) for estimating multi-person 3D poses from multi-view images.
MvP directly regresses the multi-person 3D poses in a clean and efficient way, without relying on intermediate tasks.
We show experimentally that our MvP model outperforms the state-of-the-art methods on several benchmarks while being much more efficient.
arXiv Detail & Related papers (2021-11-07T13:09:20Z) - M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object
Detection with Transformers [78.48081972698888]
We present M3DeTR, which combines different point cloud representations with different feature scales based on multi-scale feature pyramids.
M3DeTR is the first approach that unifies multiple point cloud representations, feature scales, as well as models mutual relationships between point clouds simultaneously using transformers.
arXiv Detail & Related papers (2021-04-24T06:48:23Z) - Learning Implicit 3D Representations of Dressed Humans from Sparse Views [31.584157304372425]
We propose an end-to-end approach that learns an implicit 3D representation of dressed humans from sparse camera views.
In the experiments, we show the proposed approach outperforms the state of the art on standard data both quantitatively and qualitatively.
arXiv Detail & Related papers (2021-04-16T10:20:26Z) - Auto-weighted Multi-view Feature Selection with Graph Optimization [90.26124046530319]
We propose a novel unsupervised multi-view feature selection model based on graph learning.
The contributions are threefold: (1) during the feature selection procedure, the consensus similarity graph shared by different views is learned.
Experiments on various datasets demonstrate the superiority of the proposed method compared with the state-of-the-art methods.
arXiv Detail & Related papers (2021-04-11T03:25:25Z) - Distribution Alignment: A Unified Framework for Long-tail Visual
Recognition [52.36728157779307]
We propose a unified distribution alignment strategy for long-tail visual recognition.
We then introduce a generalized re-weight method in the two-stage learning to balance the class prior.
Our approach achieves the state-of-the-art results across all four recognition tasks with a simple and unified framework.
arXiv Detail & Related papers (2021-03-30T14:09:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.