Test-Time Deep Thinking to Explore Implicit Rules
Abstract Overview
This paper studies LLM-based agents in environments with implicit rules, where hidden constraints must be inferred from interaction rather than read directly from observations. It proposes TTExplore, a test-time framework with an actor for action execution and a thinker that periodically analyzes the trajectory, hypothesizes latent rules behind failures, and provides revised guidance. Because intermediate reasoning quality is difficult to reward directly, the authors train a specialized 7B thinker, Exp-Thinker, with a reinforcement-learning pipeline that uses task-level improvement as an indirect reward and retains only a single thinking node per trajectory to reduce instability. The framework is evaluated on five text-based embodied tasks from Agentboard, covering both in-domain and out-of-domain settings.
Novelty
The distinctive contribution is a dedicated test-time exploration architecture for discovering implicit environmental rules, rather than relying only on fixed prompting, offline knowledge, or stronger pretraining. The paper also introduces a tailored training pipeline for the thinker role that uses stable task-level rewards and single-node credit assignment to make reinforcement learning of deep reasoning more feasible.
Results
Across five tasks, TTExplore with Exp-Thinker improves the average process score of baseline actors from 27.81 to 46.69 for LLaMA3-8B and from 40.87 to 54.32 for Qwen2.5-7B, matching the reported average gain of roughly 14-19 points. It also improves stronger trained actors, including a BabyAI increase for Qwen2.5-Actor from 50.62 to 60.25, and the analysis shows higher action/observation diversity with lower repetition. In efficiency comparisons, TTExplore is reported as about 1.4x slower than ReAct, but less costly than Reflexion or Best-of-N while remaining compatible with Best-of-N.
Key Points
- TTExplore separates action execution and strategic reasoning by pairing an actor with a periodically invoked thinker that infers implicit rules from recent interaction history.
- Exp-Thinker is trained with an SFT+RL pipeline that uses task-level score improvements as indirect rewards and keeps one thinking node per trajectory to stabilize credit assignment.
- Empirically, the method improves both base and trained agents on five text-based embodied tasks and is associated with more exploratory, less repetitive behavior.