Act-Observe-Rewrite: Multimodal Coding Agents as In-Context Policy Learners for Robot Manipulation
- URL: http://arxiv.org/abs/2603.04466v1
- Date: Tue, 03 Mar 2026 22:15:55 GMT
- Title: Act-Observe-Rewrite: Multimodal Coding Agents as In-Context Policy Learners for Robot Manipulation
- Authors: Vaishak Kumar,
- Abstract summary: We present Act-Observe-Rewrite (AOR), a framework in which an LLM agent improves a robot manipulation policy.<n>AOR makes the full low-level motor control implementation the unit of LLM reasoning.<n>We report promising results, with the agent achieving high success rates without demonstrations, reward engineering, or gradient updates.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Can a multimodal language model learn to manipulate physical objects by reasoning about its own failures-without gradient updates, demonstrations, or reward engineering? We argue the answer is yes, under conditions we characterise precisely. We present Act-Observe-Rewrite (AOR), a framework in which an LLM agent improves a robot manipulation policy by synthesising entirely new executable Python controller code between trials, guided by visual observations and structured episode outcomes. Unlike prior work that grounds LLMs in pre-defined skill libraries or uses code generation for one-shot plan synthesis, AOR makes the full low-level motor control implementation the unit of LLM reasoning, enabling the agent to change not just what the robot does, but how it does it. The central claim is that interpretable code as the policy representation creates a qualitatively different kind of in-context learning from opaque neural policies: the agent can diagnose systematic failures and rewrite their causes. We validate this across three robosuite manipulation tasks and report promising results, with the agent achieving high success rates without demonstrations, reward engineering, or gradient updates.
Related papers
- MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation [0.0]
MALLVI presents a framework that enables closed-loop feedback driven robotic manipulation.<n>Rather than using a single model, MALLVI coordinates specialized agents to manage perception, localization, reasoning, and high level planning.
arXiv Detail & Related papers (2026-02-18T21:28:56Z) - Demonstration-Free Robotic Control via LLM Agents [0.0]
We introduce FAEA (Frontier Agent as Embodied Agent), which applies an LLM agent framework directly to embodied manipulation without modification.<n>With privileged environment state access, FAEA achieves success rates of 84.9%, 85.7%, and 96%, respectively.<n>Our results indicate that general-purpose agents are sufficient for a class of manipulation tasks dominated by deliberative, task-level planning.
arXiv Detail & Related papers (2026-01-28T07:49:35Z) - ALRM: Agentic LLM for Robotic Manipulation [3.7473235317736058]
Large Language Models (LLMs) have recently empowered agentic frameworks to exhibit advanced reasoning and planning capabilities.<n>Large Language Models (LLMs) have recently empowered agentic frameworks to exhibit advanced reasoning and planning capabilities.
arXiv Detail & Related papers (2026-01-27T11:54:14Z) - RoboInspector: Unveiling the Unreliability of Policy Code for LLM-enabled Robotic Manipulation [7.650053106303868]
Large language models (LLMs) demonstrate remarkable capabilities in reasoning and code generation.<n>Despite advances, achieving reliable policy code generation remains a significant challenge due to the diverse requirements.<n>We introduce RoboInspector, a pipeline to unveil and characterize the unreliability of the policy code for LLM-enabled robotic manipulation.
arXiv Detail & Related papers (2025-08-29T07:47:17Z) - In-Context Learning Enables Robot Action Prediction in LLMs [52.285739178561705]
We introduce RoboPrompt, a framework that enables offthe-shelf text-only Large Language Models to directly predict robot actions.<n>RoboPrompt shows stronger performance over zero-shot and ICL baselines in simulated and real-world settings.
arXiv Detail & Related papers (2024-10-16T17:56:49Z) - Make-An-Agent: A Generalizable Policy Network Generator with Behavior-Prompted Diffusion [41.52811286996212]
Make-An-Agent is a novel policy parameter generator for behavior-to-policy generation.<n>We show how it can generate a control policy for an agent using just one demonstration of desired behaviors as a prompt.<n>We also deploy policies generated by Make-An-Agent onto real-world robots on locomotion tasks.
arXiv Detail & Related papers (2024-07-15T17:59:57Z) - LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
We introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations.<n>First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets.<n>We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control.
arXiv Detail & Related papers (2024-06-28T17:59:12Z) - ROS-LLM: A ROS framework for embodied AI with task feedback and structured reasoning [74.58666091522198]
We present a framework for intuitive robot programming by non-experts.
We leverage natural language prompts and contextual information from the Robot Operating System (ROS)
Our system integrates large language models (LLMs), enabling non-experts to articulate task requirements to the system through a chat interface.
arXiv Detail & Related papers (2024-06-28T08:28:38Z) - Executable Code Actions Elicit Better LLM Agents [76.95566120678787]
This work proposes to use Python code to consolidate Large Language Model (LLM) agents' actions into a unified action space (CodeAct)
integrated with a Python interpreter, CodeAct can execute code actions and dynamically revise prior actions or emit new actions upon new observations through multi-turn interactions.
The encouraging performance of CodeAct motivates us to build an open-source LLM agent that interacts with environments by executing interpretable code and collaborates with users using natural language.
arXiv Detail & Related papers (2024-02-01T21:38:58Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions
with Large Language Model [63.66204449776262]
Instruct2Act is a framework that maps multi-modal instructions to sequential actions for robotic manipulation tasks.
Our approach is adjustable and flexible in accommodating various instruction modalities and input types.
Our zero-shot method outperformed many state-of-the-art learning-based policies in several tasks.
arXiv Detail & Related papers (2023-05-18T17:59:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.