PointMapPolicy: Structured Point Cloud Processing for Multi-Modal Imitation Learning
- URL: http://arxiv.org/abs/2510.20406v1
- Date: Thu, 23 Oct 2025 10:17:01 GMT
- Title: PointMapPolicy: Structured Point Cloud Processing for Multi-Modal Imitation Learning
- Authors: Xiaogang Jia, Qian Wang, Anrui Wang, Han A. Wang, Balázs Gyenes, Emiliyan Gospodinov, Xinkai Jiang, Ge Li, Hongyi Zhou, Weiran Liao, Xi Huang, Maximilian Beck, Moritz Reuss, Rudolf Lioutikov, Gerhard Neumann,
- Abstract summary: Current point cloud methods struggle to capture fine-grained detail, especially for complex tasks.<n>We introduce PointMapPolicy, a novel approach that conditions diffusion policies on structured grids of points.<n>Our model efficiently fuses the point maps with RGB data for enhanced multi-modal perception.
- Score: 35.5287060355186
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Robotic manipulation systems benefit from complementary sensing modalities, where each provides unique environmental information. Point clouds capture detailed geometric structure, while RGB images provide rich semantic context. Current point cloud methods struggle to capture fine-grained detail, especially for complex tasks, which RGB methods lack geometric awareness, which hinders their precision and generalization. We introduce PointMapPolicy, a novel approach that conditions diffusion policies on structured grids of points without downsampling. The resulting data type makes it easier to extract shape and spatial relationships from observations, and can be transformed between reference frames. Yet due to their structure in a regular grid, we enable the use of established computer vision techniques directly to 3D data. Using xLSTM as a backbone, our model efficiently fuses the point maps with RGB data for enhanced multi-modal perception. Through extensive experiments on the RoboCasa and CALVIN benchmarks and real robot evaluations, we demonstrate that our method achieves state-of-the-art performance across diverse manipulation tasks. The overview and demos are available on our project page: https://point-map.github.io/Point-Map/
Related papers
- CAP-Net: A Unified Network for 6D Pose and Size Estimation of Categorical Articulated Parts from a Single RGB-D Image [86.75098349480014]
This paper tackles category-level pose estimation of articulated objects in robotic manipulation tasks.<n>We propose a single-stage Network, CAP-Net, for estimating the 6D poses and sizes of Categorical Articulated Parts.<n>We introduce the RGBD-Art dataset, the largest RGB-D articulated dataset to date, featuring RGB images and depth noise simulated from real sensors.
arXiv Detail & Related papers (2025-04-15T14:30:26Z) - Towards Fusing Point Cloud and Visual Representations for Imitation Learning [57.886331184389604]
We propose FPV-Net, a novel imitation learning method that effectively combines the strengths of both point cloud and RGB modalities.<n>Our method conditions the point-cloud encoder on global and local image tokens using adaptive layer norm conditioning.
arXiv Detail & Related papers (2025-02-17T20:46:54Z) - Monocular Visual Place Recognition in LiDAR Maps via Cross-Modal State Space Model and Multi-View Matching [2.400446821380503]
We introduce an efficient framework to learn descriptors for both RGB images and point clouds.
It takes visual state space model (VMamba) as the backbone and employs a pixel-view-scene joint training strategy.
A visible 3D points overlap strategy is then designed to quantify the similarity between point cloud views and RGB images for multi-view supervision.
arXiv Detail & Related papers (2024-10-08T18:31:41Z) - PointRegGPT: Boosting 3D Point Cloud Registration using Generative Point-Cloud Pairs for Training [90.06520673092702]
We present PointRegGPT, boosting 3D point cloud registration using generative point-cloud pairs for training.
To our knowledge, this is the first generative approach that explores realistic data generation for indoor point cloud registration.
arXiv Detail & Related papers (2024-07-19T06:29:57Z) - ImageManip: Image-based Robotic Manipulation with Affordance-guided Next
View Selection [10.162882793554191]
3D articulated object manipulation is essential for enabling robots to interact with their environment.
Many existing studies make use of 3D point clouds as the primary input for manipulation policies.
RGB images offer high-resolution observations using cost effective devices but lack spatial 3D geometric information.
This framework is designed to capture multiple perspectives of the target object and infer depth information to complement its geometry.
arXiv Detail & Related papers (2023-10-13T12:42:54Z) - Point-GCC: Universal Self-supervised 3D Scene Pre-training via
Geometry-Color Contrast [9.14535402695962]
Geometry and color information provided by point clouds are crucial for 3D scene understanding.
We propose a universal 3D scene pre-training framework via Geometry-Color Contrast (Point-GCC)
Point-GCC aligns geometry and color information using a Siamese network.
arXiv Detail & Related papers (2023-05-31T07:44:03Z) - Neural Implicit Dense Semantic SLAM [83.04331351572277]
We propose a novel RGBD vSLAM algorithm that learns a memory-efficient, dense 3D geometry, and semantic segmentation of an indoor scene in an online manner.
Our pipeline combines classical 3D vision-based tracking and loop closing with neural fields-based mapping.
Our proposed algorithm can greatly enhance scene perception and assist with a range of robot control problems.
arXiv Detail & Related papers (2023-04-27T23:03:52Z) - Flattening-Net: Deep Regular 2D Representation for 3D Point Cloud
Analysis [66.49788145564004]
We present an unsupervised deep neural architecture called Flattening-Net to represent irregular 3D point clouds of arbitrary geometry and topology.
Our methods perform favorably against the current state-of-the-art competitors.
arXiv Detail & Related papers (2022-12-17T15:05:25Z) - RGB-D Saliency Detection via Cascaded Mutual Information Minimization [122.8879596830581]
Existing RGB-D saliency detection models do not explicitly encourage RGB and depth to achieve effective multi-modal learning.
We introduce a novel multi-stage cascaded learning framework via mutual information minimization to "explicitly" model the multi-modal information between RGB image and depth data.
arXiv Detail & Related papers (2021-09-15T12:31:27Z) - Object-Augmented RGB-D SLAM for Wide-Disparity Relocalisation [3.888848425698769]
We propose a novel object-augmented RGB-D SLAM system that is capable of constructing a consistent object map and performing relocalisation based on centroids of objects in the map.
arXiv Detail & Related papers (2021-08-05T11:02:25Z) - LCD -- Line Clustering and Description for Place Recognition [29.053923938306323]
We introduce a novel learning-based approach to place recognition, using RGB-D cameras and line clusters as visual and geometric features.
We present a neural network architecture based on the attention mechanism for frame-wise line clustering.
A similar neural network is used for the description of these clusters with a compact embedding of 128 floating point numbers.
arXiv Detail & Related papers (2020-10-21T09:52:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.