V-CAGE: Context-Aware Generation and Verification for Scalable Long-Horizon Embodied Tasks
- URL: http://arxiv.org/abs/2601.15164v1
- Date: Wed, 21 Jan 2026 16:41:51 GMT
- Title: V-CAGE: Context-Aware Generation and Verification for Scalable Long-Horizon Embodied Tasks
- Authors: Yaru Liu, Ao-bo Wang, Nanyang Ye,
- Abstract summary: V-CAGE is a closed-loop framework for generating semantically aligned manipulation datasets at scale.<n>We propose a context-aware instantiation mechanism that enforces geometric consistency during scene synthesis.<n>We also employ a hierarchical instruction decomposition module to bridge the gap between abstract intent and low-level control.
- Score: 6.820118518027692
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning long-horizon embodied behaviors from synthetic data remains challenging because generated scenes are often physically implausible, language-driven programs frequently "succeed" without satisfying task semantics, and high-level instructions require grounding into executable action sequences. To address these limitations, we introduce V-CAGE, a closed-loop framework for generating robust, semantically aligned manipulation datasets at scale. First, we propose a context-aware instantiation mechanism that enforces geometric consistency during scene synthesis. By dynamically maintaining a map of prohibited spatial areas as objects are placed, our system prevents interpenetration and ensures reachable, conflict-free configurations in cluttered environments. Second, to bridge the gap between abstract intent and low-level control, we employ a hierarchical instruction decomposition module. This decomposes high-level goals (e.g., "get ready for work") into compositional action primitives, facilitating coherent long-horizon planning. Crucially, we enforce semantic correctness through a VLM-based verification loop. Acting as a visual critic, the VLM performs rigorous rejection sampling after each subtask, filtering out "silent failures" where code executes but fails to achieve the visual goal. Experiments demonstrate that V-CAGE yields datasets with superior physical and semantic fidelity, significantly boosting the success rate and generalization of downstream policies compared to non-verified baselines.
Related papers
- Unifying Heterogeneous Multi-Modal Remote Sensing Detection Via Language-Pivoted Pretraining [59.2578488860426]
Heterogeneous multi-modal remote sensing object detection aims to accurately detect objects from diverse sensors.<n>Existing approaches largely adopt a late alignment paradigm, in which modality alignment and task-specific optimization are entangled during downstream fine-tuning.<n>We propose BabelRS, a unified language-pivoted pretraining framework that explicitly decouples modality alignment from downstream task learning.
arXiv Detail & Related papers (2026-03-02T11:38:12Z) - Non-Markovian Long-Horizon Robot Manipulation via Keyframe Chaining [56.62125584296097]
Keyframe-Chaining VLA is a framework that extracts and links key historical frames to model long-horizon dependencies.<n>We design a progress-aware mechanism that dynamically retrieves historical frames based on their temporal relevance to the current execution phase.<n>We introduce a suite of four Non-Markovian manipulation tasks built upon the ManiSkill simulator to measure task success rates.
arXiv Detail & Related papers (2026-03-02T05:26:29Z) - Rethinking Progression of Memory State in Robotic Manipulation: An Object-Centric Perspective [16.541717037293278]
We introduce LIBERO-Mem, a non-Markovian task suite for stress-testing robotic manipulation under object-level partial observability.<n>It combines short- and long-horizon object tracking with temporally sequenced subgoals, requiring reasoning beyond the current frame.<n>We propose Embodied-SlotSSM, a slot-centric VLA framework built for temporal scalability.
arXiv Detail & Related papers (2025-11-14T16:56:01Z) - Bridge Thinking and Acting: Unleashing Physical Potential of VLM with Generalizable Action Expert [60.88976842557026]
Vision-Language Models (VLM) have demonstrated impressive planning and reasoning capabilities.<n>Recent dual-system approaches attempt to decouple "thinking" from "acting"<n>We introduce a framework centered around a generalizable action expert.
arXiv Detail & Related papers (2025-10-04T18:33:27Z) - From Code to Action: Hierarchical Learning of Diffusion-VLM Policies [8.0703783175731]
Imitation learning for robotic manipulation often suffers from limited generalization and data scarcity.<n>In this work, we introduce a hierarchical framework that leverages code-generating vision-language models (VLMs)<n>We find that this design enables interpretable policy decomposition, improves generalization when compared to flat policies and enables separate evaluation of high-level planning and low-level control.
arXiv Detail & Related papers (2025-09-29T15:22:18Z) - SAGE: Scene Graph-Aware Guidance and Execution for Long-Horizon Manipulation Tasks [3.688836621357062]
Long-horizon manipulation tasks involve extended action sequences and complex object interactions.<n>We propose SAGE, a novel framework for Scene Graph-Aware Guidance and Execution in Long-Horizon Manipulation Tasks.<n> SAGE consists of two key components: (1) a scene graph-based task planner that uses VLMs and LLMs to parse the environment and reason about physically-grounded scene state transition sequences, and (2) a decoupled structural image editing pipeline that controllably converts each target sub-goal graph into a corresponding image.
arXiv Detail & Related papers (2025-09-26T06:14:55Z) - Embodied Long Horizon Manipulation with Closed-loop Code Generation and Incremental Few-shot Adaptation [12.077740860502878]
Embodied long-horizon manipulation requires robotic systems to process multimodal inputs-such as vision and natural language-and translate them into executable actions.<n>Recent methods have explored using large language models (LLMs) as high-level planners that decompose tasks into subtasks using natural language and guide pretrained low-level controllers.<n>Our framework achieves state-of-the-art performance on 30+ diverse seen and unseen long-horizon tasks across LoHoRavens, CALVIN, Franka Kitchen, and cluttered real-world settings.
arXiv Detail & Related papers (2025-03-27T20:32:58Z) - EmbodiedVSR: Dynamic Scene Graph-Guided Chain-of-Thought Reasoning for Visual Spatial Tasks [24.41705039390567]
EmbodiedVSR (Embodied Visual Spatial Reasoning) is a novel framework that integrates dynamic scene graph-guided Chain-of-Thought (CoT) reasoning.<n>Our method enables zero-shot spatial reasoning without task-specific fine-tuning.<n>Experiments demonstrate that our framework significantly outperforms existing MLLM-based methods in accuracy and reasoning coherence.
arXiv Detail & Related papers (2025-03-14T05:06:07Z) - Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection [56.66677293607114]
We propose Code-as-Monitor (CaM) for both open-set reactive and proactive failure detection.<n>To enhance the accuracy and efficiency of monitoring, we introduce constraint elements that abstract constraint-related entities.<n>Experiments show that CaM achieves a 28.7% higher success rate and reduces execution time by 31.8% under severe disturbances.
arXiv Detail & Related papers (2024-12-05T18:58:27Z) - Universal Visual Decomposer: Long-Horizon Manipulation Made Easy [54.93745986073738]
Real-world robotic tasks stretch over extended horizons and encompass multiple stages.
Prior task decomposition methods require task-specific knowledge, are computationally intensive, and cannot readily be applied to new tasks.
We propose Universal Visual Decomposer (UVD), an off-the-shelf task decomposition method for visual long horizon manipulation.
We extensively evaluate UVD on both simulation and real-world tasks, and in all cases, UVD substantially outperforms baselines across imitation and reinforcement learning settings.
arXiv Detail & Related papers (2023-10-12T17:59:41Z) - Task-Adaptive Saliency Guidance for Exemplar-free Class Incremental Learning [60.501201259732625]
We introduce task-adaptive saliency for EFCIL and propose a new framework, which we call Task-Adaptive Saliency Supervision (TASS)
Our experiments demonstrate that our method can better preserve saliency maps across tasks and achieve state-of-the-art results on the CIFAR-100, Tiny-ImageNet, and ImageNet-Subset EFCIL benchmarks.
arXiv Detail & Related papers (2022-12-16T02:43:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.