LACY: A Vision-Language Model-based Language-Action Cycle for Self-Improving Robotic Manipulation
- URL: http://arxiv.org/abs/2511.02239v1
- Date: Tue, 04 Nov 2025 04:02:51 GMT
- Title: LACY: A Vision-Language Model-based Language-Action Cycle for Self-Improving Robotic Manipulation
- Authors: Youngjin Hong, Houjian Yu, Mingen Li, Changhyun Choi,
- Abstract summary: We introduce LACY (Language-Action Cycle), a unified framework that learns bidirectional mappings within a single vision-language model.<n>We show that LACY improves task success rates by 56.46% on average and yields more robust language-action grounding for robotic manipulation.
- Score: 11.419077130835829
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning generalizable policies for robotic manipulation increasingly relies on large-scale models that map language instructions to actions (L2A). However, this one-way paradigm often produces policies that execute tasks without deeper contextual understanding, limiting their ability to generalize or explain their behavior. We argue that the complementary skill of mapping actions back to language (A2L) is essential for developing more holistic grounding. An agent capable of both acting and explaining its actions can form richer internal representations and unlock new paradigms for self-supervised learning. We introduce LACY (Language-Action Cycle), a unified framework that learns such bidirectional mappings within a single vision-language model. LACY is jointly trained on three synergistic tasks: generating parameterized actions from language (L2A), explaining observed actions in language (A2L), and verifying semantic consistency between two language descriptions (L2C). This enables a self-improving cycle that autonomously generates and filters new training data through an active augmentation strategy targeting low-confidence cases, thereby improving the model without additional human labels. Experiments on pick-and-place tasks in both simulation and the real world show that LACY improves task success rates by 56.46% on average and yields more robust language-action grounding for robotic manipulation. Project page: https://vla2026.github.io/LACY/
Related papers
- LUMOS: Language-Conditioned Imitation Learning with World Models [31.827127896338336]
We introduce LUMOS, a language-conditioned multi-task imitation learning framework for robotics.<n>LUMOS learns skills by practicing them over many long-horizon rollouts in the latent space of a learned world model.<n>We are the first to learn a language-conditioned continuous visuomotor control for a real-world robot within an offline world model.
arXiv Detail & Related papers (2025-03-13T13:48:24Z) - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic
Control [140.48218261864153]
We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control.
Our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training.
arXiv Detail & Related papers (2023-07-28T21:18:02Z) - Robotic Skill Acquisition via Instruction Augmentation with
Vision-Language Models [70.82705830137708]
We introduce Data-driven Instruction Augmentation for Language-conditioned control (DIAL)
We utilize semi-language labels leveraging the semantic understanding of CLIP to propagate knowledge onto large datasets of unlabelled demonstration data.
DIAL enables imitation learning policies to acquire new capabilities and generalize to 60 novel instructions unseen in the original dataset.
arXiv Detail & Related papers (2022-11-21T18:56:00Z) - Grounding Language with Visual Affordances over Unstructured Data [26.92329260907805]
We propose a novel approach to efficiently learn language-conditioned robot skills from unstructured, offline and reset-free data.
We exploit a self-supervised visuo-lingual affordance model, which requires as little as 1% of the total data with language.
We find that our method is capable of completing long-horizon, multi-tier tasks in the real world, while requiring an order of magnitude less data than previous approaches.
arXiv Detail & Related papers (2022-10-04T21:16:48Z) - Learning Flexible Translation between Robot Actions and Language
Descriptions [16.538887534958555]
We propose a paired gated autoencoders (PGAE) for flexible translation between robot actions and language descriptions.
We train our model in an end-to-end fashion by pairing each action with appropriate descriptions that contain a signal informing about the translation direction.
With the option to use a pretrained language model as the language encoder, our model has the potential to recognise unseen natural language input.
arXiv Detail & Related papers (2022-07-15T12:37:05Z) - Do As I Can, Not As I Say: Grounding Language in Robotic Affordances [119.29555551279155]
Large language models can encode a wealth of semantic knowledge about the world.
Such knowledge could be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language.
We show how low-level skills can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally-extended instructions.
arXiv Detail & Related papers (2022-04-04T17:57:11Z) - Pre-Trained Language Models for Interactive Decision-Making [72.77825666035203]
We describe a framework for imitation learning in which goals and observations are represented as a sequence of embeddings.
We demonstrate that this framework enables effective generalization across different environments.
For test tasks involving novel goals or novel scenes, initializing policies with language models improves task completion rates by 43.6%.
arXiv Detail & Related papers (2022-02-03T18:55:52Z) - Language Models as Zero-Shot Planners: Extracting Actionable Knowledge
for Embodied Agents [111.33545170562337]
We investigate the possibility of grounding high-level tasks, expressed in natural language, to a chosen set of actionable steps.
We find that if pre-trained LMs are large enough and prompted appropriately, they can effectively decompose high-level tasks into low-level plans.
We propose a procedure that conditions on existing demonstrations and semantically translates the plans to admissible actions.
arXiv Detail & Related papers (2022-01-18T18:59:45Z) - Read Like Humans: Autonomous, Bidirectional and Iterative Language
Modeling for Scene Text Recognition [80.446770909975]
Linguistic knowledge is of great benefit to scene text recognition.
How to effectively model linguistic rules in end-to-end deep networks remains a research challenge.
We propose an autonomous, bidirectional and iterative ABINet for scene text recognition.
arXiv Detail & Related papers (2021-03-11T06:47:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.