SpikeGrasp: A Benchmark for 6-DoF Grasp Pose Detection from Stereo Spike Streams
- URL: http://arxiv.org/abs/2510.10602v1
- Date: Sun, 12 Oct 2025 13:36:40 GMT
- Title: SpikeGrasp: A Benchmark for 6-DoF Grasp Pose Detection from Stereo Spike Streams
- Authors: Zhuoheng Gao, Jiyao Zhang, Zhiyong Xie, Hao Dong, Zhaofei Yu, Rongmei Chen, Guozhang Chen, Tiejun Huang,
- Abstract summary: Most robotic grasping systems rely on converting sensor data into explicit 3D point clouds, which is a computational step not found in biological intelligence.<n>We introduce SpikeGrasp, a framework that mimics the biological visuomotor pathway, processing raw, asynchronous events from stereo spike cameras, similarly to retinas, to directly infer grasp poses.<n>Our model fuses these stereo spike streams and uses a recurrent spiking neural network, analogous to high-level visual processing, to iteratively refine grasp hypotheses without ever reconstructing a point cloud.
- Score: 57.84331423686738
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most robotic grasping systems rely on converting sensor data into explicit 3D point clouds, which is a computational step not found in biological intelligence. This paper explores a fundamentally different, neuro-inspired paradigm for 6-DoF grasp detection. We introduce SpikeGrasp, a framework that mimics the biological visuomotor pathway, processing raw, asynchronous events from stereo spike cameras, similarly to retinas, to directly infer grasp poses. Our model fuses these stereo spike streams and uses a recurrent spiking neural network, analogous to high-level visual processing, to iteratively refine grasp hypotheses without ever reconstructing a point cloud. To validate this approach, we built a large-scale synthetic benchmark dataset. Experiments show that SpikeGrasp surpasses traditional point-cloud-based baselines, especially in cluttered and textureless scenes, and demonstrates remarkable data efficiency. By establishing the viability of this end-to-end, neuro-inspired approach, SpikeGrasp paves the way for future systems capable of the fluid and efficient manipulation seen in nature, particularly for dynamic objects.
Related papers
- SpikeStereoNet: A Brain-Inspired Framework for Stereo Depth Estimation from Spike Streams [70.9610707466343]
Bio-inspired spike cameras emit asynchronous events at microsecond-level resolution, providing an alternative sensing modality.<n>Existing methods lack specialized stereo algorithms and benchmarks tailored to the spike data.<n>We propose SpikeStereoNet, a brain-inspired framework and the first to estimate stereo depth directly from raw spike streams.
arXiv Detail & Related papers (2025-05-26T04:14:34Z) - D-NPC: Dynamic Neural Point Clouds for Non-Rigid View Synthesis from Monocular Video [53.83936023443193]
This paper contributes to the field by introducing a new synthesis method for dynamic novel view from monocular video, such as smartphone captures.<n>Our approach represents the as a $textitdynamic neural point cloud$, an implicit time-conditioned point cloud that encodes local geometry and appearance in separate hash-encoded neural feature grids.
arXiv Detail & Related papers (2024-06-14T14:35:44Z) - SpikeNeRF: Learning Neural Radiance Fields from Continuous Spike Stream [26.165424006344267]
Spike cameras offer distinct advantages over standard cameras.
Existing approaches reliant on spike cameras often assume optimal illumination.
We introduce SpikeNeRF, the first work that derives a NeRF-based volumetric scene representation from spike camera data.
arXiv Detail & Related papers (2024-03-17T13:51:25Z) - Finding Visual Saliency in Continuous Spike Stream [23.591309376586835]
In this paper, we investigate the visual saliency in the continuous spike stream for the first time.
We propose a Recurrent Spiking Transformer framework, which is based on a full spiking neural network.
Our framework exhibits a substantial margin of improvement in highlighting and capturing visual saliency in the spike stream.
arXiv Detail & Related papers (2024-03-10T15:15:35Z) - SpATr: MoCap 3D Human Action Recognition based on Spiral Auto-encoder and Transformer Network [1.4732811715354455]
We introduce a novel approach for 3D human action recognition, denoted as SpATr (Spiral Auto-encoder and Transformer Network)
A lightweight auto-encoder, based on spiral convolutions, is employed to extract spatial geometrical features from each 3D mesh.
The proposed method is evaluated on three prominent 3D human action datasets: Babel, MoVi, and BMLrub.
arXiv Detail & Related papers (2023-06-30T11:49:00Z) - StarNet: Style-Aware 3D Point Cloud Generation [82.30389817015877]
StarNet is able to reconstruct and generate high-fidelity and even 3D point clouds using a mapping network.
Our framework achieves comparable state-of-the-art performance on various metrics in the point cloud reconstruction and generation tasks.
arXiv Detail & Related papers (2023-03-28T08:21:44Z) - Highly Efficient 3D Human Pose Tracking from Events with Spiking Spatiotemporal Transformer [23.15179173446486]
We introduce the first sparse Spiking Neural Networks (SNNs) framework for 3D human pose tracking based solely on events.<n>Our approach eliminates the need to convert sparse data to dense formats or incorporate additional images, thereby fully exploiting the innate sparsity of input events.<n> Empirical experiments demonstrate the superiority of our approach over existing state-of-the-art (SOTA) ANN-based methods, requiring only 19.1% FLOPs and 3.6% cost energy.
arXiv Detail & Related papers (2023-03-16T22:56:12Z) - Scene Synthesis via Uncertainty-Driven Attribute Synchronization [52.31834816911887]
This paper introduces a novel neural scene synthesis approach that can capture diverse feature patterns of 3D scenes.
Our method combines the strength of both neural network-based and conventional scene synthesis approaches.
arXiv Detail & Related papers (2021-08-30T19:45:07Z) - InfoFocus: 3D Object Detection for Autonomous Driving with Dynamic
Information Modeling [65.47126868838836]
We propose a novel 3D object detection framework with dynamic information modeling.
Coarse predictions are generated in the first stage via a voxel-based region proposal network.
Experiments are conducted on the large-scale nuScenes 3D detection benchmark.
arXiv Detail & Related papers (2020-07-16T18:27:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.