Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization
- URL: http://arxiv.org/abs/2505.15660v1
- Date: Wed, 21 May 2025 15:35:57 GMT
- Title: Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization
- Authors: Jiaming Zhou, Ke Ye, Jiayi Liu, Teli Ma, Zifang Wang, Ronghe Qiu, Kun-Yu Lin, Zhilin Zhao, Junwei Liang,
- Abstract summary: AGNOSTOS is a novel simulation benchmark designed to rigorously evaluate cross-task zero-shot generalization in manipulation.<n>X-ICM is a method that conditions large language models on in-context demonstrations to predict action sequences for unseen tasks.<n>We believe AGNOSTOS and X-ICM will serve as valuable tools for advancing general-purpose robotic manipulation.
- Score: 12.052338864734917
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The generalization capabilities of vision-language-action (VLA) models to unseen tasks are crucial to achieving general-purpose robotic manipulation in open-world settings. However, the cross-task generalization capabilities of existing VLA models remain significantly underexplored. To address this gap, we introduce AGNOSTOS, a novel simulation benchmark designed to rigorously evaluate cross-task zero-shot generalization in manipulation. AGNOSTOS comprises 23 unseen manipulation tasks for testing, distinct from common training task distributions, and incorporates two levels of generalization difficulty to assess robustness. Our systematic evaluation reveals that current VLA models, despite being trained on diverse datasets, struggle to generalize effectively to these unseen tasks. To overcome this limitation, we propose Cross-Task In-Context Manipulation (X-ICM), a method that conditions large language models (LLMs) on in-context demonstrations from seen tasks to predict action sequences for unseen tasks. Additionally, we introduce a dynamics-guided sample selection strategy that identifies relevant demonstrations by capturing cross-task dynamics. On AGNOSTOS, X-ICM significantly improves cross-task zero-shot generalization performance over leading VLAs. We believe AGNOSTOS and X-ICM will serve as valuable tools for advancing general-purpose robotic manipulation.
Related papers
- Enhancing Cross-task Transfer of Large Language Models via Activation Steering [75.41750053623298]
Cross-task in-context learning offers a direct solution for transferring knowledge across tasks.<n>We investigate whether cross-task transfer can be achieved via latent space steering without parameter updates or input expansion.<n>We propose a novel Cross-task Activation Steering Transfer framework that enables effective transfer by manipulating the model's internal activation states.
arXiv Detail & Related papers (2025-07-17T15:47:22Z) - One Object, Multiple Lies: A Benchmark for Cross-task Adversarial Attack on Unified Vision-Language Models [19.705340191553496]
Unified vision-language models (VLMs) can address diverse tasks through different instructions within a shared computational architecture.<n> adversarial inputs must remain effective across multiple task instructions that may be unpredictably applied to process the same malicious content.<n>In this paper, we introduce CrossVLAD, a new benchmark dataset for evaluating cross-task adversarial attacks on unified VLMs.
arXiv Detail & Related papers (2025-07-10T12:40:34Z) - Object-Focus Actor for Data-efficient Robot Generalization Dexterous Manipulation [14.977743061489518]
We introduce Object-Focus Actor (OFA), a novel, data-efficient approach for generalized dexterous manipulation.<n>OFA exploits the consistent end trajectories observed in dexterous manipulation tasks, allowing for efficient policy training.<n>OFA achieves robust performance with only 10 demonstrations, highlighting its data efficiency.
arXiv Detail & Related papers (2025-05-21T04:37:56Z) - DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control [7.626715427413578]
Vision-language-action (VLA) models have shown promise for generalizable robot skills.<n>Current VLA models often focus on scaling the vision-language model (VLM) component, while the action space representation remains a critical bottleneck.<n>This paper introduces DexVLA, a novel framework designed to enhance the efficiency and generalization capabilities ofVLAs for complex, long-horizon tasks.
arXiv Detail & Related papers (2025-02-09T11:25:56Z) - VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks [100.3234156027118]
We present VLABench, an open-source benchmark for evaluating universal LCM task learning.<n>VLABench provides 100 carefully designed categories of tasks, with strong randomization in each category of task and a total of 2000+ objects.<n>The benchmark assesses multiple competencies including understanding of mesh&texture, spatial relationship, semantic instruction, physical laws, knowledge transfer and reasoning.
arXiv Detail & Related papers (2024-12-24T06:03:42Z) - Vision Language Models are In-Context Value Learners [89.29486557646624]
We present Generative Value Learning (GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress.
Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks.
arXiv Detail & Related papers (2024-11-07T09:17:50Z) - RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents [105.13169239919272]
We propose RH20T-P, a primitive-level robotic manipulation dataset.<n>It contains about 38k video clips covering 67 diverse manipulation tasks in real-world scenarios.<n>We standardize a plan-execute CGA paradigm and implement an exemplar baseline called RA-P on our RH20T-P.
arXiv Detail & Related papers (2024-03-28T17:42:54Z) - CrossCodeBench: Benchmarking Cross-Task Generalization of Source Code
Models [33.78307982736911]
Cross-task generalization is of strong research and application value.
We propose a large-scale benchmark that includes 216 existing code-related tasks.
arXiv Detail & Related papers (2023-02-08T13:04:52Z) - Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and
Vision-Language Tasks [86.66733026149892]
We propose Uni-Perceiver v2, which is the first generalist model capable of handling major large-scale vision and vision-gnostic tasks.
Specifically, images are encoded as general region proposals, while texts are encoded via a Transformer-based language model.
Uni-Perceiver v2 achieves competitive performance on a broad range of vision and vision-language tasks.
arXiv Detail & Related papers (2022-11-17T18:59:52Z) - SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark
for Semantic and Generative Capabilities [76.97949110580703]
We introduce SUPERB-SG, a new benchmark to evaluate pre-trained models across various speech tasks.
We use a lightweight methodology to test the robustness of representations learned by pre-trained models under shifts in data domain.
We also show that the task diversity of SUPERB-SG coupled with limited task supervision is an effective recipe for evaluating the generalizability of model representation.
arXiv Detail & Related papers (2022-03-14T04:26:40Z) - Goal-Conditioned End-to-End Visuomotor Control for Versatile Skill
Primitives [89.34229413345541]
We propose a conditioning scheme which avoids pitfalls by learning the controller and its conditioning in an end-to-end manner.
Our model predicts complex action sequences based directly on a dynamic image representation of the robot motion.
We report significant improvements in task success over representative MPC and IL baselines.
arXiv Detail & Related papers (2020-03-19T15:04:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.