Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs
- URL: http://arxiv.org/abs/2602.21198v1
- Date: Tue, 24 Feb 2026 18:55:18 GMT
- Title: Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs
- Authors: Yining Hong, Huang Huang, Manling Li, Li Fei-Fei, Jiajun Wu, Yejin Choi,
- Abstract summary: Embodied robots cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials.<n>We introduce Reflective Test-Time Planning, which integrates two modes of reflection: textitreflection-in-action and textitreflection-on-action<n>We also include retrospective reflection, allowing the agent to re-evaluate earlier decisions and perform model updates with hindsight.
- Score: 63.88783817420284
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather than accumulate into experience. Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: \textit{reflection-in-action}, where the agent uses test-time scaling to generate and score multiple candidate actions using internal reflections before execution; and \textit{reflection-on-action}, which uses test-time training to update both its internal reflection model and its action policy based on external reflections after execution. We also include retrospective reflection, allowing the agent to re-evaluate earlier decisions and perform model updates with hindsight for proper long-horizon credit assignment. Experiments on our newly-designed Long-Horizon Household benchmark and MuJoCo Cupboard Fitting benchmark show significant gains over baseline models, with ablative studies validating the complementary roles of reflection-in-action and reflection-on-action. Qualitative analyses, including real-robot trials, highlight behavioral correction through reflection.
Related papers
- PreFlect: From Retrospective to Prospective Reflection in Large Language Model Agents [30.225072803272273]
We introduce PreFlect, a prospective reflection mechanism that shifts the paradigm from post hoc correction to pre-execution foresight.<n>We distill planning errors from historical agent trajectories, capturing recurring success and failure patterns observed across past executions.
arXiv Detail & Related papers (2026-02-06T20:42:44Z) - Teaching Large Reasoning Models Effective Reflection [62.73646680747003]
Large Reasoning Models (LRMs) have recently shown impressive performance on complex reasoning tasks.<n>However, not all reflections are beneficial-many are superficial, offering little to no improvement over the original answer.<n>We first propose Self-Critique Fine-Tuning (SCFT), a training framework that enhances the model's reflective reasoning ability using only self-generated critiques.
arXiv Detail & Related papers (2026-01-19T04:51:53Z) - First Try Matters: Revisiting the Role of Reflection in Reasoning Models [66.39546876232512]
We focus on reflective behaviours where the model has already produced an answer but continues reflecting before finalizing its output.<n>Our analysis reveals that reflections are predominantly confirmatory and rarely alter the model's initial answer.<n>We propose a question-aware early-stopping method that enhances inference-time token efficiency by stopping the reasoning process once a few plausible candidate answers are generated.
arXiv Detail & Related papers (2025-10-09T14:57:10Z) - SAMULE: Self-Learning Agents Enhanced by Multi-level Reflection [14.40651157974557]
SAMULE is a new framework for self-learning agents powered by a retrospective language model that is trained based on Multi-Level Reflection Synthesis.<n>It first synthesizes high-quality reflections across three complementary levels: Single-Trajectory Learning (micro-level) for detailed error correction; Intra-Task Learning (meso-level) to build error across multiple trials of the same task, and Inter-Task Learning (macro-level) to extract transferable insights based on same typed errors from diverse task failures.
arXiv Detail & Related papers (2025-09-24T21:02:15Z) - Unveiling the Latent Directions of Reflection in Large Language Models [3.396557052704669]
We investigate reflection through the lens of latent directions in model activations.<n>New reflection-inducing instructions can be systematically identified, and reflective behavior can be directly enhanced or suppressed.<n>This work opens a path toward mechanistic understanding of reflective reasoning in large language models.
arXiv Detail & Related papers (2025-08-23T11:05:15Z) - Perception in Reflection [39.33505560810175]
We present a perception in reflection paradigm designed to transcend the limitations of current large vision-language models.<n>We propose Reflective Perception (RePer), a dual-model reflection mechanism that systematically alternates between policy and critic models.
arXiv Detail & Related papers (2025-04-09T17:59:02Z) - Instruct-of-Reflection: Enhancing Large Language Models Iterative Reflection Capabilities via Dynamic-Meta Instruction [11.838351314880736]
Instruct-of-Reflection (IoRT) is a novel and general reflection framework that leverages dynamic-meta instruction to enhance the iterative reflection capability of Large Language Models (LLMs)<n>Our experiments demonstrate that IoRT achieves an average improvement of 10.1% over established baselines in mathematical and commonsense reasoning tasks.
arXiv Detail & Related papers (2025-03-02T14:02:03Z) - Meta-Reflection: A Feedback-Free Reflection Learning Framework [57.14485943991588]
We propose Meta-Reflection, a feedback-free reflection mechanism that requires only a single inference pass without external feedback.<n>Motivated by the human ability to remember and retrieve reflections from past experiences, Meta-Reflection integrates reflective insights into a codebook.<n>To thoroughly investigate and evaluate the practicality of Meta-Reflection in real-world scenarios, we introduce an industrial e-commerce benchmark named E-commerce Customer Intent Detection.
arXiv Detail & Related papers (2024-12-18T12:20:04Z) - Re-ReST: Reflection-Reinforced Self-Training for Language Agents [101.22559705696885]
Self-training in language agents can generate supervision from the agent itself.<n>We present Reflection-Reinforced Self-Training (Re-ReST), which uses a textitreflector to refine low-quality generated samples.
arXiv Detail & Related papers (2024-06-03T16:21:38Z) - Self-Contrast: Better Reflection Through Inconsistent Solving Perspectives [45.87069217634753]
Research indicates without external feedback, Large Language Model's intrinsic reflection is unstable.
Our investigation unveils that the key bottleneck is the quality of the self-evaluated feedback.
We advocate Self-Contrast: It adaptively explores diverse solving perspectives tailored to the request, contrasts the differences, and summarizes these discrepancies into a checklist which could be used to re-examine and eliminate discrepancies.
arXiv Detail & Related papers (2024-01-04T00:32:33Z) - Self-RAG: Learning to Retrieve, Generate, and Critique through
Self-Reflection [74.51523859064802]
We introduce a new framework called Self-Reflective Retrieval-Augmented Generation (Self-RAG)
Self-RAG enhances an LM's quality and factuality through retrieval and self-reflection.
It significantly outperforms state-of-the-art LLMs and retrieval-augmented models on a diverse set of tasks.
arXiv Detail & Related papers (2023-10-17T18:18:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.