Open-Vocabulary Spatio-Temporal Scene Graph for Robot Perception and Teleoperation Planning
- URL: http://arxiv.org/abs/2509.23107v2
- Date: Mon, 27 Oct 2025 01:43:56 GMT
- Title: Open-Vocabulary Spatio-Temporal Scene Graph for Robot Perception and Teleoperation Planning
- Authors: Yi Wang, Zeyu Xue, Mujie Liu, Tongqin Zhang, Yan Hu, Zhou Zhao, Chenguang Yang, Zhenyu Lu,
- Abstract summary: In dynamic remote scenes, transmission latency during bidirectional communication creates gaps between remote perceived states and operator intent.<n>We present a representation that enriches open-vocabulary perception with temporal dynamics and lightweight latency annotations.<n>We show that our method achieves 74 percent node accuracy on the Replica benchmark, outperforming Concept.Graph.
- Score: 55.90805559207812
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Teleoperation via natural-language reduces operator workload and enhances safety in high-risk or remote settings. However, in dynamic remote scenes, transmission latency during bidirectional communication creates gaps between remote perceived states and operator intent, leading to command misunderstanding and incorrect execution. To mitigate this, we introduce the Spatio-Temporal Open-Vocabulary Scene Graph (ST-OVSG), a representation that enriches open-vocabulary perception with temporal dynamics and lightweight latency annotations. ST-OVSG leverages LVLMs to construct open-vocabulary 3D object representations, and extends them into the temporal domain via Hungarian assignment with our temporal matching cost, yielding a unified spatio-temporal scene graph. A latency tag is embedded to enable LVLM planners to retrospectively query past scene states, thereby resolving local-remote state mismatches caused by transmission delays. To further reduce redundancy and highlight task-relevant cues, we propose a task-oriented subgraph filtering strategy that produces compact inputs for the planner. ST-OVSG generalizes to novel categories and enhances planning robustness against transmission latency without requiring fine-tuning. Experiments show that our method achieves 74 percent node accuracy on the Replica benchmark, outperforming ConceptGraph. Notably, in the latency-robustness experiment, the LVLM planner assisted by ST-OVSG achieved a planning success rate of 70.5 percent.
Related papers
- Agentic Spatio-Temporal Grounding via Collaborative Reasoning [80.83158605034465]
Temporal Video Grounding aims to retrieve thetemporal tube of a target object or person in a video given a text query.<n>We propose the Agentic Spatio-Temporal Grounder (ASTG) framework for the task of STVG towards an open-world and training-free scenario.<n>Specifically, two specialized agents SRA (Spatial Reasoning Agent) and TRA (Temporal Reasoning Agent) constructed leveraging on modern Multimoal Large Language Models (MLLMs)<n>Experiments on popular benchmarks demonstrate the superiority of the proposed approach where it outperforms existing weakly-supervised and zero-shot approaches by a margin
arXiv Detail & Related papers (2026-02-10T10:16:27Z) - TIDAL: Temporally Interleaved Diffusion and Action Loop for High-Frequency VLA Control [15.534182843429043]
Large-scale Vision-Language-Action (VLA) models offer semantic generalization but suffer from high inference latency.<n>We propose TIDAL, a hierarchical framework that decouples semantic reasoning from high-frequency actuation.<n> TIDAL operates as a backbone-agnostic module for diffusion-basedVLAs, using a dual-frequency architecture.
arXiv Detail & Related papers (2026-01-21T12:43:11Z) - ImplicitRDP: An End-to-End Visual-Force Diffusion Policy with Structural Slow-Fast Learning [52.86018040861575]
We propose a unified end-to-end visual-force diffusion policy that integrates visual planning and reactive force control within a single network.<n>We introduce Structural Slow-Fast Learning, a mechanism utilizing causal attention to simultaneously process asynchronous visual and force tokens.<n>Experiments on contact-rich tasks demonstrate that ImplicitRDP significantly outperforms both vision-only and hierarchical baselines.
arXiv Detail & Related papers (2025-12-11T18:59:46Z) - 1 + 1 > 2: Detector-Empowered Video Large Language Model for Spatio-Temporal Grounding and Reasoning [53.28271278708241]
We present a Detector-Empowered Video LLM, short for DEViL.<n> DEViL couples a Video LLM with an open-vocabulary detector (OVD)<n>Unlike tokens that merely serve as spatial prompts or segmentor switches, the RST functions as both a control signal and a replacement for the OVD's text embedding.
arXiv Detail & Related papers (2025-12-07T06:11:15Z) - UTAL-GNN: Unsupervised Temporal Action Localization using Graph Neural Networks [0.0]
Fine-grained action localization in untrimmed sports videos presents a significant challenge due to rapid and subtle motion transitions.<n>Existing supervised and weakly supervised solutions often rely on extensive datasets and high-capacity models, making them computationally intensive and less adaptable to real-world scenarios.<n>Our approach pre-trains an Attention-based Spatio-Temporal Graph Convolutional Network (ASTGCN) on a pose-sequence denoising annotated with blockwise partitions.<n>Our method achieves a mean Average Precision (mAP) of 82.66% and average latency localization of 29.09 ms on the DSV Diving dataset
arXiv Detail & Related papers (2025-08-27T07:51:02Z) - Multimodal Spatio-temporal Graph Learning for Alignment-free RGBT Video Object Detection [13.682115079677466]
RGB-Thermal Video Object Detection (RGBT VOD) can address the limitation of traditional RGB-based VOD in challenging lighting conditions.<n>We propose a novel Multimodal Spatio-temporal Graph learning Network (MSGNet) for alignment-free RGBT VOD problem.
arXiv Detail & Related papers (2025-04-16T05:32:59Z) - SCoTT: Strategic Chain-of-Thought Tasking for Wireless-Aware Robot Navigation in Digital Twins [78.53885607559958]
We propose SCoTT, a wireless-aware path planning framework.<n>We show that SCoTT achieves path gains within 2% of DP-WA* while consistently generating shorter trajectories.<n>We also show the practical viability of our approach by deploying SCoTT as a ROS node within Gazebo simulations.
arXiv Detail & Related papers (2024-11-27T10:45:49Z) - STGformer: Efficient Spatiotemporal Graph Transformer for Traffic Forecasting [11.208740750755025]
Traffic is a cornerstone of smart city management enabling efficient allocation and transportation planning.
Deep learning, with its ability to capture complex nonlinear patterns in data, has emerged as a powerful tool for traffic forecasting.
graph neural networks (GCNs) and transformer-based models have shown promise, but their computational demands often hinder their application to realworld networks.
We propose a noveltemporal graph transformer (STG) architecture, enabling efficient modeling of both global and local traffic patterns while maintaining a manageable computational footprint.
arXiv Detail & Related papers (2024-10-01T04:15:48Z) - Scaling Learning based Policy Optimization for Temporal Logic Tasks by Controller Network Dropout [4.421486904657393]
We introduce a model-based approach for training feedback controllers for an autonomous agent operating in a highly nonlinear environment.
We show how this learning problem is similar to training recurrent neural networks (RNNs), where the number of recurrent units is proportional to the temporal horizon of the agent's task objectives.
We introduce a novel gradient approximation algorithm based on the idea of dropout or gradient sampling.
arXiv Detail & Related papers (2024-03-23T12:53:51Z) - Efficient Parallel Split Learning over Resource-constrained Wireless
Edge Networks [44.37047471448793]
In this paper, we advocate the integration of edge computing paradigm and parallel split learning (PSL)
We propose an innovative PSL framework, namely, efficient parallel split learning (EPSL) to accelerate model training.
We show that the proposed EPSL framework significantly decreases the training latency needed to achieve a target accuracy.
arXiv Detail & Related papers (2023-03-26T16:09:48Z) - Augmented Transformer with Adaptive Graph for Temporal Action Proposal
Generation [79.98992138865042]
We present an augmented transformer with adaptive graph network (ATAG) to exploit both long-range and local temporal contexts for TAPG.
Specifically, we enhance the vanilla transformer by equipping a snippet actionness loss and a front block, dubbed augmented transformer.
An adaptive graph convolutional network (GCN) is proposed to build local temporal context by mining the position information and difference between adjacent features.
arXiv Detail & Related papers (2021-03-30T02:01:03Z) - StrObe: Streaming Object Detection from LiDAR Packets [73.27333924964306]
Rolling shutter LiDARs emitted as a stream of packets, each covering a sector of the 360deg coverage.
Modern perception algorithms wait for the full sweep to be built before processing the data, which introduces an additional latency.
In this paper we propose StrObe, a novel approach that minimizes latency by ingesting LiDAR packets and emitting a stream of detections without waiting for the full sweep to be built.
arXiv Detail & Related papers (2020-11-12T14:57:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.