Related papers: Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs

Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs

URL: http://arxiv.org/abs/2602.21198v1
Date: Tue, 24 Feb 2026 18:55:18 GMT
Title: Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs
Authors: Yining Hong, Huang Huang, Manling Li, Li Fei-Fei, Jiajun Wu, Yejin Choi,
Abstract summary: Embodied robots cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials.<n>We introduce Reflective Test-Time Planning, which integrates two modes of reflection: textitreflection-in-action and textitreflection-on-action<n>We also include retrospective reflection, allowing the agent to re-evaluate earlier decisions and perform model updates with hindsight.
Score: 63.88783817420284
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather than accumulate into experience. Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: \textit{reflection-in-action}, where the agent uses test-time scaling to generate and score multiple candidate actions using internal reflections before execution; and \textit{reflection-on-action}, which uses test-time training to update both its internal reflection model and its action policy based on external reflections after execution. We also include retrospective reflection, allowing the agent to re-evaluate earlier decisions and perform model updates with hindsight for proper long-horizon credit assignment. Experiments on our newly-designed Long-Horizon Household benchmark and MuJoCo Cupboard Fitting benchmark show significant gains over baseline models, with ablative studies validating the complementary roles of reflection-in-action and reflection-on-action. Qualitative analyses, including real-robot trials, highlight behavioral correction through reflection.

Related papers

PreFlect: From Retrospective to Prospective Reflection in Large Language Model Agents [30.225072803272273]
We introduce PreFlect, a prospective reflection mechanism that shifts the paradigm from post hoc correction to pre-execution foresight.<n>We distill planning errors from historical agent trajectories, capturing recurring success and failure patterns observed across past executions.
arXiv Detail & Related papers (2026-02-06T20:42:44Z)
Teaching Large Reasoning Models Effective Reflection [62.73646680747003]
Large Reasoning Models (LRMs) have recently shown impressive performance on complex reasoning tasks.<n>However, not all reflections are beneficial-many are superficial, offering little to no improvement over the original answer.<n>We first propose Self-Critique Fine-Tuning (SCFT), a training framework that enhances the model's reflective reasoning ability using only self-generated critiques.
arXiv Detail & Related papers (2026-01-19T04:51:53Z)
First Try Matters: Revisiting the Role of Reflection in Reasoning Models [66.39546876232512]
We focus on reflective behaviours where the model has already produced an answer but continues reflecting before finalizing its output.<n>Our analysis reveals that reflections are predominantly confirmatory and rarely alter the model's initial answer.<n>We propose a question-aware early-stopping method that enhances inference-time token efficiency by stopping the reasoning process once a few plausible candidate answers are generated.
arXiv Detail & Related papers (2025-10-09T14:57:10Z)
SAMULE: Self-Learning Agents Enhanced by Multi-level Reflection [14.40651157974557]
SAMULE is a new framework for self-learning agents powered by a retrospective language model that is trained based on Multi-Level Reflection Synthesis.<n>It first synthesizes high-quality reflections across three complementary levels: Single-Trajectory Learning (micro-level) for detailed error correction; Intra-Task Learning (meso-level) to build error across multiple trials of the same task, and Inter-Task Learning (macro-level) to extract transferable insights based on same typed errors from diverse task failures.
arXiv Detail & Related papers (2025-09-24T21:02:15Z)
Unveiling the Latent Directions of Reflection in Large Language Models [3.396557052704669]
We investigate reflection through the lens of latent directions in model activations.<n>New reflection-inducing instructions can be systematically identified, and reflective behavior can be directly enhanced or suppressed.<n>This work opens a path toward mechanistic understanding of reflective reasoning in large language models.
arXiv Detail & Related papers (2025-08-23T11:05:15Z)
Perception in Reflection [39.33505560810175]
We present a perception in reflection paradigm designed to transcend the limitations of current large vision-language models.<n>We propose Reflective Perception (RePer), a dual-model reflection mechanism that systematically alternates between policy and critic models.
arXiv Detail & Related papers (2025-04-09T17:59:02Z)
Instruct-of-Reflection: Enhancing Large Language Models Iterative Reflection Capabilities via Dynamic-Meta Instruction [11.838351314880736]
Instruct-of-Reflection (IoRT) is a novel and general reflection framework that leverages dynamic-meta instruction to enhance the iterative reflection capability of Large Language Models (LLMs)<n>Our experiments demonstrate that IoRT achieves an average improvement of 10.1% over established baselines in mathematical and commonsense reasoning tasks.
arXiv Detail & Related papers (2025-03-02T14:02:03Z)
Meta-Reflection: A Feedback-Free Reflection Learning Framework [57.14485943991588]
We propose Meta-Reflection, a feedback-free reflection mechanism that requires only a single inference pass without external feedback.<n>Motivated by the human ability to remember and retrieve reflections from past experiences, Meta-Reflection integrates reflective insights into a codebook.<n>To thoroughly investigate and evaluate the practicality of Meta-Reflection in real-world scenarios, we introduce an industrial e-commerce benchmark named E-commerce Customer Intent Detection.
arXiv Detail & Related papers (2024-12-18T12:20:04Z)
Re-ReST: Reflection-Reinforced Self-Training for Language Agents [101.22559705696885]
Self-training in language agents can generate supervision from the agent itself.<n>We present Reflection-Reinforced Self-Training (Re-ReST), which uses a textitreflector to refine low-quality generated samples.
arXiv Detail & Related papers (2024-06-03T16:21:38Z)
Self-Contrast: Better Reflection Through Inconsistent Solving Perspectives [45.87069217634753]
Research indicates without external feedback, Large Language Model's intrinsic reflection is unstable. Our investigation unveils that the key bottleneck is the quality of the self-evaluated feedback. We advocate Self-Contrast: It adaptively explores diverse solving perspectives tailored to the request, contrasts the differences, and summarizes these discrepancies into a checklist which could be used to re-examine and eliminate discrepancies.
arXiv Detail & Related papers (2024-01-04T00:32:33Z)
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection [74.51523859064802]
We introduce a new framework called Self-Reflective Retrieval-Augmented Generation (Self-RAG) Self-RAG enhances an LM's quality and factuality through retrieval and self-reflection. It significantly outperforms state-of-the-art LLMs and retrieval-augmented models on a diverse set of tasks.
arXiv Detail & Related papers (2023-10-17T18:18:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.