MMRDN: Consistent Representation for Multi-View Manipulation
Relationship Detection in Object-Stacked Scenes
- URL: http://arxiv.org/abs/2304.12592v1
- Date: Tue, 25 Apr 2023 05:55:29 GMT
- Title: MMRDN: Consistent Representation for Multi-View Manipulation
Relationship Detection in Object-Stacked Scenes
- Authors: Han Wang, Jiayuan Zhang, Lipeng Wan, Xingyu Chen, Xuguang Lan, Nanning
Zheng
- Abstract summary: We propose a novel multi-view fusion framework, namely multi-view MRD network (MMRDN)
We project the 2D data from different views into a common hidden space and fit the embeddings with a set of Von-Mises-Fisher distributions.
We select a set of $K$ Maximum Vertical Neighbors (KMVN) points from the point cloud of each object pair, which encodes the relative position of these two objects.
- Score: 62.20046129613934
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Manipulation relationship detection (MRD) aims to guide the robot to grasp
objects in the right order, which is important to ensure the safety and
reliability of grasping in object stacked scenes. Previous works infer
manipulation relationship by deep neural network trained with data collected
from a predefined view, which has limitation in visual dislocation in
unstructured environments. Multi-view data provide more comprehensive
information in space, while a challenge of multi-view MRD is domain shift. In
this paper, we propose a novel multi-view fusion framework, namely multi-view
MRD network (MMRDN), which is trained by 2D and 3D multi-view data. We project
the 2D data from different views into a common hidden space and fit the
embeddings with a set of Von-Mises-Fisher distributions to learn the consistent
representations. Besides, taking advantage of position information within the
3D data, we select a set of $K$ Maximum Vertical Neighbors (KMVN) points from
the point cloud of each object pair, which encodes the relative position of
these two objects. Finally, the features of multi-view 2D and 3D data are
concatenated to predict the pairwise relationship of objects. Experimental
results on the challenging REGRAD dataset show that MMRDN outperforms the
state-of-the-art methods in multi-view MRD tasks. The results also demonstrate
that our model trained by synthetic data is capable to transfer to real-world
scenarios.
Related papers
- MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan.
The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z) - Multimodal 3D Object Detection on Unseen Domains [37.142470149311904]
Domain adaptation approaches assume access to unannotated samples from the test distribution to address this problem.
We propose CLIX$text3D$, a multimodal fusion and supervised contrastive learning framework for 3D object detection.
We show that CLIX$text3D$ yields state-of-the-art domain generalization performance under multiple dataset shifts.
arXiv Detail & Related papers (2024-04-17T21:47:45Z) - PoIFusion: Multi-Modal 3D Object Detection via Fusion at Points of Interest [65.48057241587398]
PoIFusion is a framework to fuse information of RGB images and LiDAR point clouds at the points of interest (PoIs)
Our approach maintains the view of each modality and obtains multi-modal features by computation-friendly projection and computation.
We conducted extensive experiments on nuScenes and Argoverse2 datasets to evaluate our approach.
arXiv Detail & Related papers (2024-03-14T09:28:12Z) - MM-Point: Multi-View Information-Enhanced Multi-Modal Self-Supervised 3D
Point Cloud Understanding [4.220064723125481]
Multi-view 2D information can provide superior self-supervised signals for 3D objects.
MM-Point is driven by intra-modal and inter-modal similarity objectives.
It achieves a peak accuracy of 92.4% on the synthetic dataset ModelNet40, and a top accuracy of 87.8% on the real-world dataset ScanObjectNN.
arXiv Detail & Related papers (2024-02-15T15:10:17Z) - SM$^3$: Self-Supervised Multi-task Modeling with Multi-view 2D Images
for Articulated Objects [24.737865259695006]
We propose a self-supervised interaction perception method, referred to as SM$3$, to model articulated objects.
By constructing 3D geometries and textures from the captured 2D images, SM$3$ achieves integrated optimization of movable part and joint parameters.
Evaluations demonstrate that SM$3$ surpasses existing benchmarks across various categories and objects, while its adaptability in real-world scenarios has been thoroughly validated.
arXiv Detail & Related papers (2024-01-17T11:15:09Z) - Towards Multimodal Multitask Scene Understanding Models for Indoor
Mobile Agents [49.904531485843464]
In this paper, we discuss the main challenge: insufficient, or even no, labeled data for real-world indoor environments.
We describe MMISM (Multi-modality input Multi-task output Indoor Scene understanding Model) to tackle the above challenges.
MMISM considers RGB images as well as sparse Lidar points as inputs and 3D object detection, depth completion, human pose estimation, and semantic segmentation as output tasks.
We show that MMISM performs on par or even better than single-task models.
arXiv Detail & Related papers (2022-09-27T04:49:19Z) - MVM3Det: A Novel Method for Multi-view Monocular 3D Detection [0.0]
MVM3Det simultaneously estimates the 3D position and orientation of the object according to the multi-view monocular information.
We present a first dataset for multi-view 3D object detection named MVM3D.
arXiv Detail & Related papers (2021-09-22T01:31:00Z) - Know Your Surroundings: Panoramic Multi-Object Tracking by Multimodality
Collaboration [56.01625477187448]
We propose a MultiModality PAnoramic multi-object Tracking framework (MMPAT)
It takes both 2D panorama images and 3D point clouds as input and then infers target trajectories using the multimodality data.
We evaluate the proposed method on the JRDB dataset, where the MMPAT achieves the top performance in both the detection and tracking tasks.
arXiv Detail & Related papers (2021-05-31T03:16:38Z) - Cross-Modality 3D Object Detection [63.29935886648709]
We present a novel two-stage multi-modal fusion network for 3D object detection.
The whole architecture facilitates two-stage fusion.
Our experiments on the KITTI dataset show that the proposed multi-stage fusion helps the network to learn better representations.
arXiv Detail & Related papers (2020-08-16T11:01:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.