Related papers: V-CAGE: Context-Aware Generation and Verification for Scalable Long-Horizon Embodied Tasks

V-CAGE: Context-Aware Generation and Verification for Scalable Long-Horizon Embodied Tasks

URL: http://arxiv.org/abs/2601.15164v1
Date: Wed, 21 Jan 2026 16:41:51 GMT
Title: V-CAGE: Context-Aware Generation and Verification for Scalable Long-Horizon Embodied Tasks
Authors: Yaru Liu, Ao-bo Wang, Nanyang Ye,
Abstract summary: V-CAGE is a closed-loop framework for generating semantically aligned manipulation datasets at scale.<n>We propose a context-aware instantiation mechanism that enforces geometric consistency during scene synthesis.<n>We also employ a hierarchical instruction decomposition module to bridge the gap between abstract intent and low-level control.
Score: 6.820118518027692
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Learning long-horizon embodied behaviors from synthetic data remains challenging because generated scenes are often physically implausible, language-driven programs frequently "succeed" without satisfying task semantics, and high-level instructions require grounding into executable action sequences. To address these limitations, we introduce V-CAGE, a closed-loop framework for generating robust, semantically aligned manipulation datasets at scale. First, we propose a context-aware instantiation mechanism that enforces geometric consistency during scene synthesis. By dynamically maintaining a map of prohibited spatial areas as objects are placed, our system prevents interpenetration and ensures reachable, conflict-free configurations in cluttered environments. Second, to bridge the gap between abstract intent and low-level control, we employ a hierarchical instruction decomposition module. This decomposes high-level goals (e.g., "get ready for work") into compositional action primitives, facilitating coherent long-horizon planning. Crucially, we enforce semantic correctness through a VLM-based verification loop. Acting as a visual critic, the VLM performs rigorous rejection sampling after each subtask, filtering out "silent failures" where code executes but fails to achieve the visual goal. Experiments demonstrate that V-CAGE yields datasets with superior physical and semantic fidelity, significantly boosting the success rate and generalization of downstream policies compared to non-verified baselines.

Related papers

Unifying Heterogeneous Multi-Modal Remote Sensing Detection Via Language-Pivoted Pretraining [59.2578488860426]
Heterogeneous multi-modal remote sensing object detection aims to accurately detect objects from diverse sensors.<n>Existing approaches largely adopt a late alignment paradigm, in which modality alignment and task-specific optimization are entangled during downstream fine-tuning.<n>We propose BabelRS, a unified language-pivoted pretraining framework that explicitly decouples modality alignment from downstream task learning.
arXiv Detail & Related papers (2026-03-02T11:38:12Z)
Non-Markovian Long-Horizon Robot Manipulation via Keyframe Chaining [56.62125584296097]
Keyframe-Chaining VLA is a framework that extracts and links key historical frames to model long-horizon dependencies.<n>We design a progress-aware mechanism that dynamically retrieves historical frames based on their temporal relevance to the current execution phase.<n>We introduce a suite of four Non-Markovian manipulation tasks built upon the ManiSkill simulator to measure task success rates.
arXiv Detail & Related papers (2026-03-02T05:26:29Z)
Rethinking Progression of Memory State in Robotic Manipulation: An Object-Centric Perspective [16.541717037293278]
We introduce LIBERO-Mem, a non-Markovian task suite for stress-testing robotic manipulation under object-level partial observability.<n>It combines short- and long-horizon object tracking with temporally sequenced subgoals, requiring reasoning beyond the current frame.<n>We propose Embodied-SlotSSM, a slot-centric VLA framework built for temporal scalability.
arXiv Detail & Related papers (2025-11-14T16:56:01Z)
Bridge Thinking and Acting: Unleashing Physical Potential of VLM with Generalizable Action Expert [60.88976842557026]
Vision-Language Models (VLM) have demonstrated impressive planning and reasoning capabilities.<n>Recent dual-system approaches attempt to decouple "thinking" from "acting"<n>We introduce a framework centered around a generalizable action expert.
arXiv Detail & Related papers (2025-10-04T18:33:27Z)
From Code to Action: Hierarchical Learning of Diffusion-VLM Policies [8.0703783175731]
Imitation learning for robotic manipulation often suffers from limited generalization and data scarcity.<n>In this work, we introduce a hierarchical framework that leverages code-generating vision-language models (VLMs)<n>We find that this design enables interpretable policy decomposition, improves generalization when compared to flat policies and enables separate evaluation of high-level planning and low-level control.
arXiv Detail & Related papers (2025-09-29T15:22:18Z)
SAGE: Scene Graph-Aware Guidance and Execution for Long-Horizon Manipulation Tasks [3.688836621357062]
Long-horizon manipulation tasks involve extended action sequences and complex object interactions.<n>We propose SAGE, a novel framework for Scene Graph-Aware Guidance and Execution in Long-Horizon Manipulation Tasks.<n> SAGE consists of two key components: (1) a scene graph-based task planner that uses VLMs and LLMs to parse the environment and reason about physically-grounded scene state transition sequences, and (2) a decoupled structural image editing pipeline that controllably converts each target sub-goal graph into a corresponding image.
arXiv Detail & Related papers (2025-09-26T06:14:55Z)
Embodied Long Horizon Manipulation with Closed-loop Code Generation and Incremental Few-shot Adaptation [12.077740860502878]
Embodied long-horizon manipulation requires robotic systems to process multimodal inputs-such as vision and natural language-and translate them into executable actions.<n>Recent methods have explored using large language models (LLMs) as high-level planners that decompose tasks into subtasks using natural language and guide pretrained low-level controllers.<n>Our framework achieves state-of-the-art performance on 30+ diverse seen and unseen long-horizon tasks across LoHoRavens, CALVIN, Franka Kitchen, and cluttered real-world settings.
arXiv Detail & Related papers (2025-03-27T20:32:58Z)
EmbodiedVSR: Dynamic Scene Graph-Guided Chain-of-Thought Reasoning for Visual Spatial Tasks [24.41705039390567]
EmbodiedVSR (Embodied Visual Spatial Reasoning) is a novel framework that integrates dynamic scene graph-guided Chain-of-Thought (CoT) reasoning.<n>Our method enables zero-shot spatial reasoning without task-specific fine-tuning.<n>Experiments demonstrate that our framework significantly outperforms existing MLLM-based methods in accuracy and reasoning coherence.
arXiv Detail & Related papers (2025-03-14T05:06:07Z)
Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection [56.66677293607114]
We propose Code-as-Monitor (CaM) for both open-set reactive and proactive failure detection.<n>To enhance the accuracy and efficiency of monitoring, we introduce constraint elements that abstract constraint-related entities.<n>Experiments show that CaM achieves a 28.7% higher success rate and reduces execution time by 31.8% under severe disturbances.
arXiv Detail & Related papers (2024-12-05T18:58:27Z)
Universal Visual Decomposer: Long-Horizon Manipulation Made Easy [54.93745986073738]
Real-world robotic tasks stretch over extended horizons and encompass multiple stages. Prior task decomposition methods require task-specific knowledge, are computationally intensive, and cannot readily be applied to new tasks. We propose Universal Visual Decomposer (UVD), an off-the-shelf task decomposition method for visual long horizon manipulation. We extensively evaluate UVD on both simulation and real-world tasks, and in all cases, UVD substantially outperforms baselines across imitation and reinforcement learning settings.
arXiv Detail & Related papers (2023-10-12T17:59:41Z)
Task-Adaptive Saliency Guidance for Exemplar-free Class Incremental Learning [60.501201259732625]
We introduce task-adaptive saliency for EFCIL and propose a new framework, which we call Task-Adaptive Saliency Supervision (TASS) Our experiments demonstrate that our method can better preserve saliency maps across tasks and achieve state-of-the-art results on the CIFAR-100, Tiny-ImageNet, and ImageNet-Subset EFCIL benchmarks.
arXiv Detail & Related papers (2022-12-16T02:43:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.