Related papers: VSFormer: Mining Correlations in Flexible View Set for Multi-view 3D Shape Understanding

VSFormer: Mining Correlations in Flexible View Set for Multi-view 3D Shape Understanding

URL: http://arxiv.org/abs/2409.09254v1
Date: Sat, 14 Sep 2024 01:48:54 GMT
Title: VSFormer: Mining Correlations in Flexible View Set for Multi-view 3D Shape Understanding
Authors: Hongyu Sun, Yongcai Wang, Peng Wang, Haoran Deng, Xudong Cai, Deying Li,
Abstract summary: This paper investigates flexible organization and explicit correlation learning for multiple views. We devise a nimble Transformer model, named emphVSFormer, to explicitly capture pairwise and higher-order correlations of all elements in the set. It reaches state-of-the-art results on various 3d recognition datasets, including ModelNet40, ScanObjectNN and RGBD.
Score: 9.048401253308123
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: View-based methods have demonstrated promising performance in 3D shape understanding. However, they tend to make strong assumptions about the relations between views or learn the multi-view correlations indirectly, which limits the flexibility of exploring inter-view correlations and the effectiveness of target tasks. To overcome the above problems, this paper investigates flexible organization and explicit correlation learning for multiple views. In particular, we propose to incorporate different views of a 3D shape into a permutation-invariant set, referred to as \emph{View Set}, which removes rigid relation assumptions and facilitates adequate information exchange and fusion among views. Based on that, we devise a nimble Transformer model, named \emph{VSFormer}, to explicitly capture pairwise and higher-order correlations of all elements in the set. Meanwhile, we theoretically reveal a natural correspondence between the Cartesian product of a view set and the correlation matrix in the attention mechanism, which supports our model design. Comprehensive experiments suggest that VSFormer has better flexibility, efficient inference efficiency and superior performance. Notably, VSFormer reaches state-of-the-art results on various 3d recognition datasets, including ModelNet40, ScanObjectNN and RGBD. It also establishes new records on the SHREC'17 retrieval benchmark. The code and datasets are available at \url{https://github.com/auniquesun/VSFormer}.

Related papers

Relation3D: Enhancing Relation Modeling for Point Cloud Instance Segmentation [4.476845464695504]
3D instance segmentation aims to predict a set of object instances in a scene, representing them as binary foreground masks with corresponding semantic labels.<n>We introduce textbfRelation3D: Enhancing Relation Modeling for Point Instance. Specifically, we introduce an adaptive superpoint aggregation module and a contrastive learning-guided superpoint refinement module to better represent superpoint features (scene features)<n>Our relation-aware self-attention mechanism enhances the capabilities of modeling relationships between queries by incorporating positional and geometric relationships into the self-attention mechanism.
arXiv Detail & Related papers (2025-06-22T03:48:19Z)
DIPO: Dual-State Images Controlled Articulated Object Generation Powered by Diverse Data [67.99373622902827]
DIPO is a framework for controllable generation of articulated 3D objects from a pair of images.<n>We propose a dual-image diffusion model that captures relationships between the image pair to generate part layouts and joint parameters.<n>We propose PM-X, a large-scale dataset of complex articulated 3D objects, accompanied by rendered images, URDF annotations, and textual descriptions.
arXiv Detail & Related papers (2025-05-26T18:55:14Z)
Generalized Visual Relation Detection with Diffusion Models [94.62313788626128]
Visual relation detection (VRD) aims to identify relationships (or interactions) between object pairs in an image. We propose to model visual relations as continuous embeddings, and design diffusion models to achieve generalized VRD in a conditional generative manner. Our Diff-VRD is able to generate visual relations beyond the pre-defined category labels of datasets.
arXiv Detail & Related papers (2025-04-16T14:03:24Z)
UniQ: Unified Decoder with Task-specific Queries for Efficient Scene Graph Generation [9.275683880295874]
Scene Graph Generation (SGG) aims at identifying object entities and reasoning their relationships within a given image. One-stage methods integrate a fixed-size set of learnable queries to jointly reason relational triplets. The challenge in one-stage methods stems from the issue of weak entanglement. We introduce UniQ, a Unified decoder with task-specific queries architecture.
arXiv Detail & Related papers (2025-01-10T03:38:16Z)
Efficient Relational Context Perception for Knowledge Graph Completion [25.903926643251076]
Knowledge Graphs (KGs) provide a structured representation of knowledge but often suffer from challenges of incompleteness. Previous knowledge graph embedding models are limited in their ability to capture expressive features. We propose Triple Receptance Perception architecture to model sequential information, enabling the learning of dynamic context.
arXiv Detail & Related papers (2024-12-31T11:25:58Z)
SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph Attention [19.23636231942245]
We propose a semantic-enhanced relational learning model based on a graph network with our designed memory graph attention layer. Our method replaces original language-independent encoding with cross-modal encoding in visual analysis. Experimental results on ReferIt3D and ScanRefer benchmarks show that the proposed method outperforms the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-13T02:11:04Z)
M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition [80.21796574234287]
M$3$Net is a matching-based framework for few-shot fine-grained (FS-FG) action recognition. It incorporates textitmulti-view encoding, textitmulti-view matching, and textitmulti-view fusion to facilitate embedding encoding, similarity matching, and decision making. Explainable visualizations and experimental results demonstrate the superiority of M$3$Net in capturing fine-grained action details.
arXiv Detail & Related papers (2023-08-06T09:15:14Z)
Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph. We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z)
Towards Unseen Triples: Effective Text-Image-joint Learning for Scene Graph Generation [30.79358827005448]
Scene Graph Generation (SGG) aims to structurally and comprehensively represent objects and their connections in images. Existing SGG models often struggle to solve the long-tailed problem caused by biased datasets. We propose a Text-Image-joint Scene Graph Generation (TISGG) model to resolve the unseen triples and improve the generalisation capability of the SGG models.
arXiv Detail & Related papers (2023-06-23T10:17:56Z)
Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets. We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models. Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z)
Auto-weighted Multi-view Feature Selection with Graph Optimization [90.26124046530319]
We propose a novel unsupervised multi-view feature selection model based on graph learning. The contributions are threefold: (1) during the feature selection procedure, the consensus similarity graph shared by different views is learned. Experiments on various datasets demonstrate the superiority of the proposed method compared with the state-of-the-art methods.
arXiv Detail & Related papers (2021-04-11T03:25:25Z)
Unified Graph Structured Models for Video Understanding [93.72081456202672]
We propose a message passing graph neural network that explicitly models relational-temporal relations. We show how our method is able to more effectively model relationships between relevant entities in the scene.
arXiv Detail & Related papers (2021-03-29T14:37:35Z)
Disentangled Graph Collaborative Filtering [100.26835145396782]
Disentangled Graph Collaborative Filtering (DGCF) is a new model for learning informative representations of users and items from interaction data. By modeling a distribution over intents for each user-item interaction, we iteratively refine the intent-aware interaction graphs and representations. DGCF achieves significant improvements over several state-of-the-art models like NGCF, DisenGCN, and MacridVAE.
arXiv Detail & Related papers (2020-07-03T15:37:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.