OmniVLA: Physically-Grounded Multimodal VLA with Unified Multi-Sensor Perception for Robotic Manipulation
- URL: http://arxiv.org/abs/2511.01210v2
- Date: Thu, 06 Nov 2025 01:42:41 GMT
- Title: OmniVLA: Physically-Grounded Multimodal VLA with Unified Multi-Sensor Perception for Robotic Manipulation
- Authors: Heyu Guo, Shanmu Wang, Ruichun Ma, Shiqi Jiang, Yasaman Ghasempour, Omid Abari, Baining Guo, Lili Qiu,
- Abstract summary: Vision-language-action (VLA) models have shown strong generalization for robotic action prediction through large-scale vision-language pretraining.<n>We present OmniVLA, an omni-modality VLA model that integrates novel sensing modalities for physically-grounded spatial intelligence beyond RGB perception.
- Score: 23.18144879039764
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Vision-language-action (VLA) models have shown strong generalization for robotic action prediction through large-scale vision-language pretraining. However, most existing models rely solely on RGB cameras, limiting their perception and, consequently, manipulation capabilities. We present OmniVLA, an omni-modality VLA model that integrates novel sensing modalities for physically-grounded spatial intelligence beyond RGB perception. The core of our approach is the sensor-masked image, a unified representation that overlays spatially grounded and physically meaningful masks onto the RGB images, derived from sensors including an infrared camera, a mmWave radar, and a microphone array. This image-native unification keeps sensor input close to RGB statistics to facilitate training, provides a uniform interface across sensor hardware, and enables data-efficient learning with lightweight per-sensor projectors. Built on this, we present a multisensory vision-language-action model architecture and train the model based on an RGB-pretrained VLA backbone. We evaluate OmniVLA on challenging real-world tasks where sensor-modality perception guides the robotic manipulation. OmniVLA achieves an average task success rate of 84%, significantly outperforms both RGB-only and raw-sensor-input baseline models by 59% and 28% respectively, meanwhile showing higher learning efficiency and stronger generalization capability.
Related papers
- DeFM: Learning Foundation Representations from Depth for Robotics [49.77188649197404]
We present DeFM, a self-supervised foundation model trained entirely on depth images for robotic applications.<n>DeFM learns geometric and semantic representations that generalize to diverse environments, tasks, and sensors.<n>It achieves state-of-the-art performance and demonstrates strong generalization from simulation to real-world environments.
arXiv Detail & Related papers (2026-01-26T19:45:31Z) - Real-Time Human-Robot Interaction Intent Detection Using RGB-based Pose and Emotion Cues with Cross-Camera Model Generalization [0.8839687029212673]
Service robots in public spaces require real-time understanding of human behavioral intentions for natural interaction.<n>We present a framework for frame-accurate human-robot interaction intent detection that fuses camera-invariant 2D skeletal pose and facial emotion features extracted from monocular RGB video.
arXiv Detail & Related papers (2025-12-18T08:44:22Z) - MiVLA: Towards Generalizable Vision-Language-Action Model with Human-Robot Mutual Imitation Pre-training [102.850162490626]
We propose MiVLA, a vision-language-action model empowered by human-robot mutual imitation pre-training.<n>We show that MiVLA achieves strong improved generalization capability, outperforming state-of-the-art VLAs.
arXiv Detail & Related papers (2025-12-17T12:59:41Z) - UNIV: Unified Foundation Model for Infrared and Visible Modalities [12.0490466425884]
We propose a biologically inspired UNified foundation model for Infrared and Visible modalities (UNIV)<n>PCCL is an attention-guided distillation framework that mimics retinal horizontal cells' lateral inhibition.<n>Our dual-knowledge preservation mechanism emulates the retina's bipolar cell signal routing.
arXiv Detail & Related papers (2025-09-19T06:07:53Z) - DepthVision: Robust Vision-Language Understanding through GAN-Based LiDAR-to-RGB Synthesis [11.976362049118782]
This letter introduces DepthVision, a framework for multimodal scene understanding.<n>It synthesizes RGB images from sparse LiDAR point clouds using a conditional generative adversarial network (GAN)<n>These synthetic views are then combined with real RGB data using a Luminance-Aware Modality Adaptation (LAMA)
arXiv Detail & Related papers (2025-09-09T07:42:07Z) - HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning [14.038083767470019]
Embodied agents operating in smart homes must understand human behavior through diverse sensory inputs and communicate via natural language.<n>In this paper, we introduce HoloLLM, a Multimodal Large Language Model (MLLM) that integrates uncommon but powerful sensing modalities.<n>We show that HoloLLM significantly outperforms existing MLLMs, improving language-grounded human sensing accuracy by up to 30%.
arXiv Detail & Related papers (2025-05-23T09:06:09Z) - AuxDet: Auxiliary Metadata Matters for Omni-Domain Infrared Small Target Detection [49.81255045696323]
We present the Auxiliary Metadata Driven Infrared Small Target Detector (AuxDet)<n>AuxDet integrates metadata semantics with visual features, guiding adaptive representation learning for each sample.<n>Experiments on the challenging WideIRSTD-Full benchmark demonstrate that AuxDet consistently outperforms state-of-the-art methods.
arXiv Detail & Related papers (2025-05-21T07:02:05Z) - Human Activity Recognition using RGB-Event based Sensors: A Multi-modal Heat Conduction Model and A Benchmark Dataset [65.76480665062363]
Human Activity Recognition primarily relied on traditional RGB cameras to achieve high-performance activity recognition.<n>Challenges in real-world scenarios, such as insufficient lighting and rapid movements, inevitably degrade the performance of RGB cameras.<n>In this work, we rethink human activity recognition by combining the RGB and event cameras.
arXiv Detail & Related papers (2025-04-08T09:14:24Z) - HAMSTER: Hierarchical Action Models For Open-World Robot Manipulation [54.03004125910057]
We show that hierarchical vision-language-action models can be more effective in utilizing off-domain data than standard monolithic VLA models.<n>We show that, with the hierarchical design, the high-level VLM can transfer across significant domain gaps between the off-domain finetuning data and real-robot testing scenarios.
arXiv Detail & Related papers (2025-02-08T07:50:22Z) - Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding [85.63710017456792]
FuSe is a novel approach that enables finetuning visuomotor generalist policies on heterogeneous sensor modalities.<n>We show that FuSe enables performing challenging tasks that require reasoning jointly over modalities such as vision, touch, and sound.<n>Experiments in the real world show that FuSeis able to increase success rates by over 20% compared to all considered baselines.
arXiv Detail & Related papers (2025-01-08T18:57:33Z) - Enhanced Vision-Language Models for Diverse Sensor Understanding: Cost-Efficient Optimization and Benchmarking [37.98711638929805]
We introduce a novel, cost-efficient paradigm that significantly advances sensor image understanding.<n>We propose Sensor-Aware Attributes Fine-Tuning (SAFT) with the Diverse Negative Attributes (DNA) optimization.<n>We present VS-TDX-the first comprehensive, public benchmark designed to rigorously evaluate VLMs' sensor-specific understanding.
arXiv Detail & Related papers (2024-12-30T06:44:25Z) - Ultra-Range Gesture Recognition using a Web-Camera in Human-Robot Interaction [2.240453048130742]
Vision-based methods for gesture recognition have been shown to be effective only up to a user-camera distance of seven meters.
We propose a novel URGR termed Graph Vision Transformer (GViT) which takes the enhanced image as input.
Evaluation of the proposed framework over diverse test data yields a high recognition rate of 98.1%.
arXiv Detail & Related papers (2023-11-26T17:27:26Z) - EventTransAct: A video transformer-based framework for Event-camera
based action recognition [52.537021302246664]
Event cameras offer new opportunities compared to standard action recognition in RGB videos.
In this study, we employ a computationally efficient model, namely the video transformer network (VTN), which initially acquires spatial embeddings per event-frame.
In order to better adopt the VTN for the sparse and fine-grained nature of event data, we design Event-Contrastive Loss ($mathcalL_EC$) and event-specific augmentations.
arXiv Detail & Related papers (2023-08-25T23:51:07Z) - A Universal Semantic-Geometric Representation for Robotic Manipulation [42.18087956844491]
We present $textbfSemantic-Geometric Representation (textbfSGR)$, a universal perception module for robotics.
SGR leverages the rich semantic information of large-scale pre-trained 2D models and inherits the merits of 3D spatial reasoning.
Our experiments demonstrate that SGR empowers the agent to successfully complete a diverse range of simulated and real-world robotic manipulation tasks.
arXiv Detail & Related papers (2023-06-18T04:34:17Z) - Bayesian Imitation Learning for End-to-End Mobile Manipulation [80.47771322489422]
Augmenting policies with additional sensor inputs, such as RGB + depth cameras, is a straightforward approach to improving robot perception capabilities.
We show that using the Variational Information Bottleneck to regularize convolutional neural networks improves generalization to held-out domains.
We demonstrate that our method is able to help close the sim-to-real gap and successfully fuse RGB and depth modalities.
arXiv Detail & Related papers (2022-02-15T17:38:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.