ESPADA: Execution Speedup via Semantics Aware Demonstration Data Downsampling for Imitation Learning
- URL: http://arxiv.org/abs/2512.07371v2
- Date: Mon, 15 Dec 2025 00:51:44 GMT
- Title: ESPADA: Execution Speedup via Semantics Aware Demonstration Data Downsampling for Imitation Learning
- Authors: Byungju Kim, Jinu Pahk, Chungwoo Lee, Jaejoon Kim, Jangha Lee, Theo Taeyeong Kim, Kyuhwan Shim, Jun Ki Lee, Byoung-Tak Zhang,
- Abstract summary: ESPADA is a semantically aware framework that segments demonstrations using a VLM-LLM pipeline with 3D gripper-object relations.<n>To scale from a single annotated episode to the full dataset, ESPADA propagates segment labels via Dynamic Time Warping.<n> ESPADA achieves approximately a 2x speed-up while maintaining success rates, narrowing the gap between human demonstrations and efficient robot control.
- Score: 18.435889278351297
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Behavior-cloning based visuomotor policies enable precise manipulation but often inherit the slow, cautious tempo of human demonstrations, limiting practical deployment. However, prior studies on acceleration methods mainly rely on statistical or heuristic cues that ignore task semantics and can fail across diverse manipulation settings. We present ESPADA, a semantic and spatially aware framework that segments demonstrations using a VLM-LLM pipeline with 3D gripper-object relations, enabling aggressive downsampling only in non-critical segments while preserving precision-critical phases, without requiring extra data or architectural modifications, or any form of retraining. To scale from a single annotated episode to the full dataset, ESPADA propagates segment labels via Dynamic Time Warping (DTW) on dynamics-only features. Across both simulation and real-world experiments with ACT and DP baselines, ESPADA achieves approximately a 2x speed-up while maintaining success rates, narrowing the gap between human demonstrations and efficient robot control.
Related papers
- From Frames to Sequences: Temporally Consistent Human-Centric Dense Prediction [22.291273919939957]
We develop a scalable synthetic data pipeline that generates human frames and motion-aligned sequences with pixel-accurate depth, normals, and masks.<n>We train a unified ViT-based dense predictor that injects an explicit geometric human prior via CSE embeddings.<n>Our two-stage training strategy, combining static pretraining with dynamic sequence supervision, enables the model first to acquire robust spatial representations and then refine temporal consistency across motion-aligned sequences.
arXiv Detail & Related papers (2026-02-02T05:28:58Z) - Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach [78.4812458793128]
We propose textbfTACO, a test-time-scaling framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks.<n>Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits.
arXiv Detail & Related papers (2025-12-02T14:42:54Z) - SemanticVLA: Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation [65.6201974979119]
We propose SemanticVLA, a novel VLA framework that performs Semantic-Hierarchical Sparsification and Enhancement for Efficient Robotic Manipulation.<n>SemanticVLA surpasses OpenVLA on LIBERO benchmark by 21.1% in success rate, while reducing training cost and inference latency by 3.0-fold and 2.7-fold.
arXiv Detail & Related papers (2025-11-13T17:24:37Z) - Obstacle Avoidance using Dynamic Movement Primitives and Reinforcement Learning [36.09105994195904]
This work proposes an alternative approach that quickly generates smooth, near-optimal collision-free 3D Cartesian trajectories from a single artificial demonstration.<n>The demonstration is encoded as a Dynamic Movement Primitive (DMP) and iteratively reshaped using policy-based reinforcement learning.<n>The approach is validated in simulation and real-robot experiments, outperforming a RRT-Connect baseline in terms of computation and execution time.
arXiv Detail & Related papers (2025-10-10T10:51:42Z) - Unsupervised Online 3D Instance Segmentation with Synthetic Sequences and Dynamic Loss [52.28880405119483]
Unsupervised online 3D instance segmentation is a fundamental yet challenging task.<n>Existing methods, such as UNIT, have made progress in this direction but remain constrained by limited training diversity.<n>We propose a new framework that enriches the training distribution through synthetic point cloud sequence generation.
arXiv Detail & Related papers (2025-09-27T08:53:27Z) - Dexplore: Scalable Neural Control for Dexterous Manipulation from Reference-Scoped Exploration [58.4036440289082]
Hand-object motion-capture (MoCap) offer large-scale, contact-rich demonstrations and hold promise for dexterous robotic scopes.<n>We introduce Dexplore, a unified single-loop optimization that performs repositories and tracking to learn robot control policies directly from MoCap at scale.
arXiv Detail & Related papers (2025-09-11T17:59:07Z) - Delving into Dynamic Scene Cue-Consistency for Robust 3D Multi-Object Tracking [16.366398265001422]
3D multi-object tracking is a critical and challenging task in the field of autonomous driving.<n>We introduce the Dynamic Scene Cue-Consistency Tracker (DSC-Track) to implement this principle.
arXiv Detail & Related papers (2025-08-15T08:48:13Z) - Multi-Modal Graph Convolutional Network with Sinusoidal Encoding for Robust Human Action Segmentation [10.122882293302787]
temporal segmentation of human actions is critical for intelligent robots in collaborative settings.<n>We propose a Multi-Modal Graph Convolutional Network (MMGCN) that integrates low-frame-rate (e.g., 1 fps) visual data with high-frame-rate (e.g., 30 fps) motion data.<n>Our approach outperforms state-of-the-art methods, especially in action segmentation accuracy.
arXiv Detail & Related papers (2025-07-01T13:55:57Z) - PPT: Pretraining with Pseudo-Labeled Trajectories for Motion Forecasting [90.47748423913369]
State-of-the-art motion forecasting models rely on large curated datasets with manually annotated or heavily post-processed trajectories.<n>PWT is a simple and scalable alternative that uses unprocessed and diverse trajectories automatically generated from off-the-shelf 3D detectors and tracking.<n>It achieves strong performance across standard benchmarks particularly in low-data regimes, and in cross-domain, end-to-end and multi-class settings.
arXiv Detail & Related papers (2024-12-09T13:48:15Z) - MATE: Motion-Augmented Temporal Consistency for Event-based Point Tracking [58.719310295870024]
This paper presents an event-based framework for tracking any point.<n>To resolve ambiguities caused by event sparsity, a motion-guidance module incorporates kinematic vectors into the local matching process.<n>The method improves the $Survival_50$ metric by 17.9% over event-only tracking of any point baseline.
arXiv Detail & Related papers (2024-12-02T09:13:29Z) - Value function estimation using conditional diffusion models for control [62.27184818047923]
We propose a simple algorithm called Diffused Value Function (DVF)
It learns a joint multi-step model of the environment-robot interaction dynamics using a diffusion model.
We show how DVF can be used to efficiently capture the state visitation measure for multiple controllers.
arXiv Detail & Related papers (2023-06-09T18:40:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.