SARL: Spatially-Aware Self-Supervised Representation Learning for Visuo-Tactile Perception
- URL: http://arxiv.org/abs/2512.01908v1
- Date: Mon, 01 Dec 2025 17:26:40 GMT
- Title: SARL: Spatially-Aware Self-Supervised Representation Learning for Visuo-Tactile Perception
- Authors: Gurmeher Khurana, Lan Wei, Dandan Zhang,
- Abstract summary: Contact-rich robotic manipulation requires representations that encode local geometry.<n>Modern visuo-tactile sensors capture both modalities in a single fused image.<n>Most self-supervised learning frameworks compress feature maps into a global vector.
- Score: 6.975054201075641
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contact-rich robotic manipulation requires representations that encode local geometry. Vision provides global context but lacks direct measurements of properties such as texture and hardness, whereas touch supplies these cues. Modern visuo-tactile sensors capture both modalities in a single fused image, yielding intrinsically aligned inputs that are well suited to manipulation tasks requiring visual and tactile information. Most self-supervised learning (SSL) frameworks, however, compress feature maps into a global vector, discarding spatial structure and misaligning with the needs of manipulation. To address this, we propose SARL, a spatially-aware SSL framework that augments the Bootstrap Your Own Latent (BYOL) architecture with three map-level objectives, including Saliency Alignment (SAL), Patch-Prototype Distribution Alignment (PPDA), and Region Affinity Matching (RAM), to keep attentional focus, part composition, and geometric relations consistent across views. These losses act on intermediate feature maps, complementing the global objective. SARL consistently outperforms nine SSL baselines across six downstream tasks with fused visual-tactile data. On the geometry-sensitive edge-pose regression task, SARL achieves a Mean Absolute Error (MAE) of 0.3955, a 30% relative improvement over the next-best SSL method (0.5682 MAE) and approaching the supervised upper bound. These findings indicate that, for fused visual-tactile data, the most effective signal is structured spatial equivariance, in which features vary predictably with object geometry, which enables more capable robotic perception.
Related papers
- SSR: Pushing the Limit of Spatial Intelligence with Structured Scene Reasoning [30.87517633729756]
SSR is a framework designed for Structured Scene Reasoning.<n>It seamlessly integrates 2D and 3D representations via a lightweight alignment mechanism.<n>It achieves state-of-the-art performance on multiple spatial intelligence benchmarks.
arXiv Detail & Related papers (2026-02-28T02:05:35Z) - SpatialActor: Exploring Disentangled Spatial Representations for Robust Robotic Manipulation [63.48859753472547]
SpatialActor is a framework for robust robotic manipulation that explicitly decouples semantics and geometry.<n>It achieves state-of-the-art performance with 87.4% on RLBench and improves by 13.9% to 19.4% under varying noisy conditions.
arXiv Detail & Related papers (2025-11-12T18:59:08Z) - Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning [93.19037653970622]
We introduce Spatial-SSRL, a self-supervised RL paradigm that derives verifiable signals directly from ordinary RGB or RGB-D images.<n>Training on our tasks substantially improves spatial reasoning while preserving general visual capabilities.<n>Our results show that simple, intrinsic supervision enables RLVR at scale and provides a practical route to stronger spatial intelligence in LVLMs.
arXiv Detail & Related papers (2025-10-31T16:30:08Z) - From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors [54.84863164684646]
Existing vision-language-action (VLA) models act in 3D real-world but are typically built on 2D encoders.<n>In this work, we introduce FALCON, a novel paradigm that injects rich 3D spatial tokens into the action head.
arXiv Detail & Related papers (2025-10-20T11:26:45Z) - Learning Hyperspectral Images with Curated Text Prompts for Efficient Multimodal Alignment [1.7188280334580195]
We make an attempt to optimize a Vision-Language Model (VLM) for hyperspectral scene understanding by exploiting a CLIP-style contrastive training framework.<n>Our framework maps voxel-level embeddings from a vision backbone onto the latent space of a frozen large embedding model.<n>It is seen that the proposed method updates only 0.07 percent of the total parameters, yet yields state-of-the-art performance.
arXiv Detail & Related papers (2025-09-20T23:23:04Z) - PSA-SSL: Pose and Size-aware Self-Supervised Learning on LiDAR Point Clouds [8.645078288584305]
We propose PSA-SSL, a novel extension to point cloud SSL that learns object pose and size-aware features.<n>Our approach outperforms other state-of-the-art SSL methods on 3D semantic segmentation and 3D object detection.
arXiv Detail & Related papers (2025-03-18T05:17:06Z) - SaliencyI2PLoc: saliency-guided image-point cloud localization using contrastive learning [17.29563451509921]
SaliencyI2PLoc is a contrastive learning architecture that fuses the saliency map into feature aggregation.<n>Our method achieves a Recall@1 of 78.92% and a Recall@20 of 97.59% on the urban scenario evaluation dataset.
arXiv Detail & Related papers (2024-12-20T05:20:10Z) - Camera-based 3D Semantic Scene Completion with Sparse Guidance Network [18.415854443539786]
We propose a camera-based semantic scene completion framework called SGN.
SGN propagates semantics from semantic-aware seed voxels to the whole scene based on spatial geometry cues.
Our experimental results demonstrate the superiority of our SGN over existing state-of-the-art methods.
arXiv Detail & Related papers (2023-12-10T04:17:27Z) - Tracking Object Positions in Reinforcement Learning: A Metric for Keypoint Detection (extended version) [5.467140383171385]
Reinforcement learning (RL) for robot control typically requires a detailed representation of the environment state.
Keypoint detectors, such as spatial autoencoders (SAEs), are a common approach to extracting a low-dimensional representation from high-dimensional image data.
arXiv Detail & Related papers (2023-12-01T13:56:28Z) - LiDAR-Camera Panoptic Segmentation via Geometry-Consistent and
Semantic-Aware Alignment [63.83894701779067]
We propose LCPS, the first LiDAR-Camera Panoptic network.
In our approach, we conduct LiDAR-Camera fusion in three stages.
Our fusion strategy improves about 6.9% PQ performance over the LiDAR-only baseline on NuScenes dataset.
arXiv Detail & Related papers (2023-08-03T10:57:58Z) - De-coupling and De-positioning Dense Self-supervised Learning [65.56679416475943]
Dense Self-Supervised Learning (SSL) methods address the limitations of using image-level feature representations when handling images with multiple objects.
We show that they suffer from coupling and positional bias, which arise from the receptive field increasing with layer depth and zero-padding.
We demonstrate the benefits of our method on COCO and on a new challenging benchmark, OpenImage-MINI, for object classification, semantic segmentation, and object detection.
arXiv Detail & Related papers (2023-03-29T18:07:25Z) - Learning Where to Learn in Cross-View Self-Supervised Learning [54.14989750044489]
Self-supervised learning (SSL) has made enormous progress and largely narrowed the gap with supervised ones.
Current methods simply adopt uniform aggregation of pixels for embedding.
We present a new approach, Learning Where to Learn (LEWEL), to adaptively aggregate spatial information of features.
arXiv Detail & Related papers (2022-03-28T17:02:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.