SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead
- URL: http://arxiv.org/abs/2512.00903v1
- Date: Sun, 30 Nov 2025 14:10:28 GMT
- Title: SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead
- Authors: Chaojun Ni, Cheng Chen, Xiaofeng Wang, Zheng Zhu, Wenzhao Zheng, Boyuan Wang, Tianrun Chen, Guosheng Zhao, Haoyun Li, Zhehao Dong, Qiang Zhang, Yun Ye, Yang Wang, Guan Huang, Wenjun Mei,
- Abstract summary: We propose SwiftVLA, an architecture that enhances a compact model with 4D understanding while preserving design efficiency.<n>Our approach features a pretrained 4D visual geometry transformer with a temporal cache that extracts 4D features from 2D images.<n>Experiments in real and simulated environments show that SwiftVLA outperforms lightweight baselines and rivals VLAs up to 7 times larger.
- Score: 56.74139420555097
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Vision-Language-Action (VLA) models built on pretrained Vision-Language Models (VLMs) show strong potential but are limited in practicality due to their large parameter counts. To mitigate this issue, using a lightweight VLM has been explored, but it compromises spatiotemporal reasoning. Although some methods suggest that incorporating additional 3D inputs can help, they usually rely on large VLMs to fuse 3D and 2D inputs and still lack temporal understanding. Therefore, we propose SwiftVLA, an architecture that enhances a compact model with 4D understanding while preserving design efficiency. Specifically, our approach features a pretrained 4D visual geometry transformer with a temporal cache that extracts 4D features from 2D images. Then, to enhance the VLM's ability to exploit both 2D images and 4D features, we introduce Fusion Tokens, a set of learnable tokens trained with a future prediction objective to generate unified representations for action generation. Finally, we introduce a mask-and-reconstruct strategy that masks 4D inputs to the VLM and trains the VLA to reconstruct them, enabling the VLM to learn effective 4D representations and allowing the 4D branch to be dropped at inference with minimal performance loss. Experiments in real and simulated environments show that SwiftVLA outperforms lightweight baselines and rivals VLAs up to 7 times larger, achieving comparable performance on edge devices while being 18 times faster and reducing memory footprint by 12 times.
Related papers
- Pri4R: Learning World Dynamics for Vision-Language-Action Models with Privileged 4D Representation [58.21084913574353]
We introduce Pri4R, a simple approach that endows V models with an implicit understanding of world dynamics.<n>Pri4R augments VLA models with a lightweight point track head that predicts 3D point tracks.<n>We show that Pri4R significantly improves performance on challenging manipulation tasks.
arXiv Detail & Related papers (2026-03-02T07:23:53Z) - MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence [50.11889361459544]
Humans are born with vision-based 4D spatial-temporal intelligence.<n>Despite its importance, this capability remains a significant bottleneck for current large language models (MLLMs)
arXiv Detail & Related papers (2026-02-28T07:23:36Z) - Align then Adapt: Rethinking Parameter-Efficient Transfer Learning in 4D Perception [44.7850628565891]
Point cloud video understanding is critical for robotics as it accurately encodes motion and scene interaction.<n>We develop a novel "Align then Adapt" (PointATA) paradigm that decomposes parameter-efficient transfer learning into two sequential stages.<n>We show that PointATA can match or even outperform strong full fine-tuning models.
arXiv Detail & Related papers (2026-02-26T14:58:59Z) - Any3D-VLA: Enhancing VLA Robustness via Diverse Point Clouds [57.024495128182195]
We conduct a pilot study across different observation spaces and visual representations.<n>Results show that explicitly lifting visual input into point clouds yields representations that better complement their corresponding 2D representations.<n>We propose Any3D-VLA to address the challenges of (1) scarce 3D data and (2) the domain gap induced by cross-environment differences and depth-scale biases.
arXiv Detail & Related papers (2026-01-31T16:34:52Z) - VLA-4D: Embedding 4D Awareness into Vision-Language-Action Models for SpatioTemporally Coherent Robotic Manipulation [54.81449795163812]
We develop a general VLA model with 4D awareness for temporally coherent robotic manipulation.<n>We extract visual features, embed 1D time into 3D positions for 4D embeddings, and fuse them into a unified visual representation via a cross-attention mechanism.<n>Within this framework, the designed visual action jointly make robotic manipulation spatially-smooth and temporally temporally coherent.
arXiv Detail & Related papers (2025-11-21T12:26:30Z) - Uni4D-LLM: A Unified SpatioTemporal-Aware VLM for 4D Understanding and Generation [61.60600246983274]
Existing 3D and 4D approaches typically embed scene geometry into autogressive model for semantic understanding and diffusion model for content generation.<n>We propose Uni4D-LLM, the first unified VLM framework withtemporal awareness for 4D scene understanding and generation.
arXiv Detail & Related papers (2025-09-28T12:06:54Z) - LEO-VL: Efficient Scene Representation for Scalable 3D Vision-Language Learning [63.19329995235114]
Key bottleneck is that current scene representations struggle to balance performance and efficiency.<n>We propose the condensed feature grid (CFG), an efficient scene representation featuring significantly reduced token overhead and strong perception capability.<n>We introduce LEO-VL, a 3D VLM trained on 700k 3D-VL data spanning four real-world indoor domains and five tasks such as captioning and dialogue.
arXiv Detail & Related papers (2025-06-11T16:56:34Z) - PointVLA: Injecting the 3D World into Vision-Language-Action Models [10.758939578236582]
We propose PointVLA, a framework that enhances pre-trained vision-language-action models with point cloud inputs without requiring retraining.<n>Our method freezes the vanilla action expert and injects 3D features via a lightweight modular block.<n>PointVLA outperforms state-of-the-art 2D imitation learning methods across both simulated and real-world robotic tasks.
arXiv Detail & Related papers (2025-03-10T16:32:41Z) - GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models [39.488763757826426]
2D Vision-Language Models (VLMs) have made significant strides in image-text understanding tasks.<n>Recent advances have leveraged 3D point clouds and multi-view images as inputs, yielding promising results.<n>We propose a vision-based solution inspired by human perception, which merely relies on visual cues for 3D spatial understanding.
arXiv Detail & Related papers (2025-01-02T18:59:59Z) - VG4D: Vision-Language Model Goes 4D Video Recognition [34.98194339741201]
Vision-Language Models (VLM) pre-trained on web-scale text-image datasets can learn fine-grained visual concepts.
We propose the Vision-Language Models Goes 4D (VG4D) framework to transfer VLM knowledge from visual-text pre-trained models to a 4D point cloud network.
arXiv Detail & Related papers (2024-04-17T17:54:49Z) - V4D:4D Convolutional Neural Networks for Video-level Representation
Learning [58.548331848942865]
Most 3D CNNs for video representation learning are clip-based, and thus do not consider video-temporal evolution of features.
We propose Video-level 4D Conal Neural Networks, or V4D, to model long-range representation with 4D convolutions.
V4D achieves excellent results, surpassing recent 3D CNNs by a large margin.
arXiv Detail & Related papers (2020-02-18T09:27:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.