FuguReport

Test-Time Deep Thinking to Explore Implicit Rules

Authors Wentong Chen, Xin Cong, Zhong Zhang, Yaxi Lu, Siyuan Zhao, Yesai Wu, Qinyu Luo, Haotian Chen, Yankai Lin, Zhiyuan Liu, Maosong Sun
Affiliations Shanghai Jiaotong University / Tsinghua University / The Johns Hopkins University / Nankai University / Renmin University of China / University of Electronic Science and Technology of China
Categories Method / Exploration Techniques / Test-time deep thinking framework, Evaluation / Task Performance Evaluation / Improvement on evodid tasks, Application / Implicit Rule Discovery / Actor guidance via inferred rules
License CC BY 4.0

Abstract Overview

This paper studies LLM-based agents in environments with implicit rules, where hidden constraints must be inferred from interaction rather than read directly from observations. It proposes TTExplore, a test-time framework with an actor for action execution and a thinker that periodically analyzes the trajectory, hypothesizes latent rules behind failures, and provides revised guidance. Because intermediate reasoning quality is difficult to reward directly, the authors train a specialized 7B thinker, Exp-Thinker, with a reinforcement-learning pipeline that uses task-level improvement as an indirect reward and retains only a single thinking node per trajectory to reduce instability. The framework is evaluated on five text-based embodied tasks from Agentboard, covering both in-domain and out-of-domain settings.

Novelty

The distinctive contribution is a dedicated test-time exploration architecture for discovering implicit environmental rules, rather than relying only on fixed prompting, offline knowledge, or stronger pretraining. The paper also introduces a tailored training pipeline for the thinker role that uses stable task-level rewards and single-node credit assignment to make reinforcement learning of deep reasoning more feasible.

Results

Across five tasks, TTExplore with Exp-Thinker improves the average process score of baseline actors from 27.81 to 46.69 for LLaMA3-8B and from 40.87 to 54.32 for Qwen2.5-7B, matching the reported average gain of roughly 14-19 points. It also improves stronger trained actors, including a BabyAI increase for Qwen2.5-Actor from 50.62 to 60.25, and the analysis shows higher action/observation diversity with lower repetition. In efficiency comparisons, TTExplore is reported as about 1.4x slower than ReAct, but less costly than Reflexion or Best-of-N while remaining compatible with Best-of-N.

Key Points

  1. TTExplore separates action execution and strategic reasoning by pairing an actor with a periodically invoked thinker that infers implicit rules from recent interaction history.
  2. Exp-Thinker is trained with an SFT+RL pipeline that uses task-level score improvements as indirect rewards and keeps one thinking node per trajectory to stabilize credit assignment.
  3. Empirically, the method improves both base and trained agents on five text-based embodied tasks and is associated with more exploratory, less repetitive behavior.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.