Related papers: Spatial Traces: Enhancing VLA Models with Spatial-Temporal Understanding

Spatial Traces: Enhancing VLA Models with Spatial-Temporal Understanding

URL: http://arxiv.org/abs/2508.09032v1
Date: Tue, 12 Aug 2025 15:53:45 GMT
Title: Spatial Traces: Enhancing VLA Models with Spatial-Temporal Understanding
Authors: Maxim A. Patratskiy, Alexey K. Kovalev, Aleksandr I. Panov,
Abstract summary: We introduce a method that projects visual traces of key points from observations onto depth maps, enabling models to capture both spatial and temporal information simultaneously.<n>Experiments in SimplerEnv show that the mean number of tasks successfully solved increased for 4% compared to SpatialVLA and 19% compared to TraceVLA.
Score: 44.99833362998488
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language-Action models have demonstrated remarkable capabilities in predicting agent movements within virtual environments and real-world scenarios based on visual observations and textual instructions. Although recent research has focused on enhancing spatial and temporal understanding independently, this paper presents a novel approach that integrates both aspects through visual prompting. We introduce a method that projects visual traces of key points from observations onto depth maps, enabling models to capture both spatial and temporal information simultaneously. The experiments in SimplerEnv show that the mean number of tasks successfully solved increased for 4% compared to SpatialVLA and 19% compared to TraceVLA. Furthermore, we show that this enhancement can be achieved with minimal training data, making it particularly valuable for real-world applications where data collection is challenging. The project page is available at https://ampiromax.github.io/ST-VLA.

Related papers

Nüwa: Mending the Spatial Integrity Torn by VLM Token Pruning [82.39668822222386]
Vision token pruning has proven to be an effective acceleration technique for the efficient Vision Language Model (VLM)<n>We propose $textNwa$, a two-stage token pruning framework that enables efficient feature aggregation while maintaining spatial integrity.<n>Experiments demonstrate that $textNwa$ achieves SOTA performance on multiple VQA benchmarks (from 94% to 95%) and yields substantial improvements on visual grounding tasks (from 7% to 47%)
arXiv Detail & Related papers (2026-02-03T00:51:03Z)
VILOD: A Visual Interactive Labeling Tool for Object Detection [0.0]
This thesis develops and investigates "VILOD: A Visual Interactive Labeling tool for Object Detection"<n>It enables users to explore data, interpret model states, AL suggestions, and implement diverse sample selection strategies within an iterative HITL workflow for Object Detection.<n>The study showed that different visually-guided labeling strategies employed within VILOD result in competitive OD performance trajectories.
arXiv Detail & Related papers (2025-08-29T19:27:10Z)
ReconVLA: Reconstructive Vision-Language-Action Model as Effective Robot Perceiver [35.25196177784228]
We propose ReconVLA, a reconstructive VLA model with an implicit grounding paradigm.<n>Conditioned on the model's visual outputs, a diffusion transformer reconstructs the gaze region of the image.<n>This process prompts the VLA model to learn fine-grained representations and accurately allocate visual attention.
arXiv Detail & Related papers (2025-08-14T04:20:19Z)
TrackVLA: Embodied Visual Tracking in the Wild [34.03604806748204]
Embodied visual tracking is a fundamental skill in Embodied AI, enabling an agent to follow a specific target in dynamic environments using only egocentric vision.<n>Existing approaches typically address this challenge through a modular separation of recognition and planning.<n>We propose TrackVLA, a Vision-Language-Action model that learns the synergy between object recognition and trajectory planning.
arXiv Detail & Related papers (2025-05-29T07:28:09Z)
ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models [68.46716645478661]
Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content.<n>Current VLMs excel primarily at egocentric spatial reasoning (from the camera's perspective) but fail to generalize to allocentric viewpoints.<n>We introduce ViewSpatial-Bench, the first comprehensive benchmark designed specifically for multi-viewpoint spatial localization recognition evaluation.
arXiv Detail & Related papers (2025-05-27T17:59:26Z)
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [61.143381152739046]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.<n>Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.<n>We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z)
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs [38.02017186215372]
Integration of Large Language Models (LLMs) into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks. However, existing V-LLMs demonstrate weak spatial reasoning and localization awareness. We explore how image-space coordinate based instruction fine-tuning objectives could inject spatial awareness into V-LLMs.
arXiv Detail & Related papers (2024-04-11T03:09:34Z)
GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection [24.48128633414131]
We propose a zero-shot method that harnesses visual grounding ability from existing models trained from image-text pairs and pure object detection data. We demonstrate that the proposed method significantly outperforms other zero-shot methods on RefCOCO/+/g datasets.
arXiv Detail & Related papers (2023-12-22T20:14:55Z)
Voila-A: Aligning Vision-Language Models with User's Gaze Attention [56.755993500556734]
We introduce gaze information as a proxy for human attention to guide Vision-Language Models (VLMs) We propose a novel approach, Voila-A, for gaze alignment to enhance the interpretability and effectiveness of these models in real-world applications.
arXiv Detail & Related papers (2023-12-22T17:34:01Z)
Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets. We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models. Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z)
DeepVisualInsight: Time-Travelling Visualization for Spatio-Temporal Causality of Deep Classification Training [7.4940788786485095]
We propose a time-travelling visual solution DeepVisualInsight aiming to manifest causality while training a deep learning image. We show how gradient-descent sampling techniques can influence and reshape the layout of learnt input representation and the boundaries in consecutive epochs. Our experiments show that, comparing to baseline approaches, we achieve the best visualization performance regarding the spatial/temporal properties and visualization efficiency.
arXiv Detail & Related papers (2021-12-31T07:05:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.