SCA-PVNet: Self-and-Cross Attention Based Aggregation of Point Cloud and
Multi-View for 3D Object Retrieval
- URL: http://arxiv.org/abs/2307.10601v2
- Date: Thu, 30 Nov 2023 02:24:29 GMT
- Title: SCA-PVNet: Self-and-Cross Attention Based Aggregation of Point Cloud and
Multi-View for 3D Object Retrieval
- Authors: Dongyun Lin, Yi Cheng, Aiyuan Guo, Shangbo Mao, Yiqun Li
- Abstract summary: Multi-modality 3D object retrieval is rarely developed and analyzed on large-scale datasets.
We propose self-and-cross attention based aggregation of point cloud and multi-view images (SCA-PVNet) for 3D object retrieval.
- Score: 8.74845857766369
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To address 3D object retrieval, substantial efforts have been made to
generate highly discriminative descriptors of 3D objects represented by a
single modality, e.g., voxels, point clouds or multi-view images. It is
promising to leverage the complementary information from multi-modality
representations of 3D objects to further improve retrieval performance.
However, multi-modality 3D object retrieval is rarely developed and analyzed on
large-scale datasets. In this paper, we propose self-and-cross attention based
aggregation of point cloud and multi-view images (SCA-PVNet) for 3D object
retrieval. With deep features extracted from point clouds and multi-view
images, we design two types of feature aggregation modules, namely the
In-Modality Aggregation Module (IMAM) and the Cross-Modality Aggregation Module
(CMAM), for effective feature fusion. IMAM leverages a self-attention mechanism
to aggregate multi-view features while CMAM exploits a cross-attention
mechanism to interact point cloud features with multi-view features. The final
descriptor of a 3D object for object retrieval can be obtained via
concatenating the aggregated features from both modules. Extensive experiments
and analysis are conducted on three datasets, ranging from small to large
scale, to show the superiority of the proposed SCA-PVNet over the
state-of-the-art methods.
Related papers
- PVAFN: Point-Voxel Attention Fusion Network with Multi-Pooling Enhancing for 3D Object Detection [59.355022416218624]
integration of point and voxel representations is becoming more common in LiDAR-based 3D object detection.
We propose a novel two-stage 3D object detector, called Point-Voxel Attention Fusion Network (PVAFN)
PVAFN uses a multi-pooling strategy to integrate both multi-scale and region-specific information effectively.
arXiv Detail & Related papers (2024-08-26T19:43:01Z) - MV2DFusion: Leveraging Modality-Specific Object Semantics for Multi-Modal 3D Detection [28.319440934322728]
MV2DFusion is a multi-modal detection framework that integrates the strengths of both worlds through an advanced query-based fusion mechanism.
Our framework's flexibility allows it to integrate with any image and point cloud-based detectors, showcasing its adaptability and potential for future advancements.
arXiv Detail & Related papers (2024-08-12T06:46:05Z) - PoIFusion: Multi-Modal 3D Object Detection via Fusion at Points of Interest [65.48057241587398]
PoIFusion is a framework to fuse information of RGB images and LiDAR point clouds at the points of interest (PoIs)
Our approach maintains the view of each modality and obtains multi-modal features by computation-friendly projection and computation.
We conducted extensive experiments on nuScenes and Argoverse2 datasets to evaluate our approach.
arXiv Detail & Related papers (2024-03-14T09:28:12Z) - SM$^3$: Self-Supervised Multi-task Modeling with Multi-view 2D Images
for Articulated Objects [24.737865259695006]
We propose a self-supervised interaction perception method, referred to as SM$3$, to model articulated objects.
By constructing 3D geometries and textures from the captured 2D images, SM$3$ achieves integrated optimization of movable part and joint parameters.
Evaluations demonstrate that SM$3$ surpasses existing benchmarks across various categories and objects, while its adaptability in real-world scenarios has been thoroughly validated.
arXiv Detail & Related papers (2024-01-17T11:15:09Z) - Multi-Projection Fusion and Refinement Network for Salient Object
Detection in 360{\deg} Omnidirectional Image [141.10227079090419]
We propose a Multi-Projection Fusion and Refinement Network (MPFR-Net) to detect the salient objects in 360deg omnidirectional image.
MPFR-Net uses the equirectangular projection image and four corresponding cube-unfolding images as inputs.
Experimental results on two omnidirectional datasets demonstrate that the proposed approach outperforms the state-of-the-art methods both qualitatively and quantitatively.
arXiv Detail & Related papers (2022-12-23T14:50:40Z) - CMR3D: Contextualized Multi-Stage Refinement for 3D Object Detection [57.44434974289945]
We propose Contextualized Multi-Stage Refinement for 3D Object Detection (CMR3D) framework.
Our framework takes a 3D scene as input and strives to explicitly integrate useful contextual information of the scene.
In addition to 3D object detection, we investigate the effectiveness of our framework for the problem of 3D object counting.
arXiv Detail & Related papers (2022-09-13T05:26:09Z) - A Simple Baseline for Multi-Camera 3D Object Detection [94.63944826540491]
3D object detection with surrounding cameras has been a promising direction for autonomous driving.
We present SimMOD, a Simple baseline for Multi-camera Object Detection.
We conduct extensive experiments on the 3D object detection benchmark of nuScenes to demonstrate the effectiveness of SimMOD.
arXiv Detail & Related papers (2022-08-22T03:38:01Z) - M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object
Detection with Transformers [78.48081972698888]
We present M3DeTR, which combines different point cloud representations with different feature scales based on multi-scale feature pyramids.
M3DeTR is the first approach that unifies multiple point cloud representations, feature scales, as well as models mutual relationships between point clouds simultaneously using transformers.
arXiv Detail & Related papers (2021-04-24T06:48:23Z) - 3D-MAN: 3D Multi-frame Attention Network for Object Detection [22.291051951077485]
3D-MAN is a 3D multi-frame attention network that effectively aggregates features from multiple perspectives.
We show that 3D-MAN achieves state-of-the-art results compared to published single-frame and multi-frame methods.
arXiv Detail & Related papers (2021-03-30T03:44:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.