FuguReport

SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent

Authors Yuyang Hu, Hongjin Qian, Shuting Wang, Jiongnan Liu, Ziliang Zhao, Jiejun Tan, Zheng Liu, Zhicheng Dou
Affiliations Beijing Academy of Artificial Intelligence / Renmin University of China
Categories Method / Memory Management / Compact memory queue for continuous interaction, Application / Reasoning Agent / Long-horizon agent inference, Evaluation / Agent Performance / Intent-driven recall with raw trajectory pages
License CC BY 4.0

Abstract Overview

This paper frames long-horizon agent reasoning as a memory-access problem rather than only a context-length problem. It proposes State-Adaptive Memory (SAM), a standalone memory framework that converts ongoing interaction histories into compact memory cues while preserving raw trajectory pages outside the active context. During inference, the agent uses its current intent to select cues and trigger reconstruction of decision-relevant information from the stored pages. The memory module is trained separately from the reasoning backbone using expert-guided supervised fine-tuning and a reinforcement-learning procedure called OAT-GRPO. Evaluations on BrowseComp, BrowseComp-ZH, WideSearch, and HLE show consistent improvements over heuristic context-management baselines across both GLM-4.7 and Qwen3.5-35B-A3B backbones.

Novelty

The distinctive contribution is a cue-page memory design that keeps compact cues in context but preserves raw trajectory pages for later, intent-conditioned recall, instead of treating summaries as full replacements for history. The work is also unusual in optimizing memory as a standalone module with expert traces plus a tree-structured RL objective that assigns credit at the memory-action level.

Results

Across four long-horizon benchmarks, SAM is reported as the strongest context-management method on both tested backbones. For GLM-4.7, it reaches an average score of 57.0 versus 54.6 for the best heuristic baseline and 49.4 without context management; for Qwen3.5-35B-A3B, it reaches 48.8 versus 46.2 for the best heuristic baseline and 44.5 without context management. Ablations further indicate that both the supervised and RL training stages contribute, and that intent-driven episodic recall is a major source of the gains.

Key Points

  1. SAM reorganizes long interaction histories into compact memory cues plus externally stored raw pages, enabling intent-driven recall of detailed past information.
  2. The method is backbone-agnostic at deployment: a single Qwen3.5-9B memory model is shared across different agent backbones while only the memory module is trained.
  3. Experiments and ablations suggest that explicit, state-conditioned episodic recall outperforms heuristics such as truncation, recency windows, discard-tool strategies, and rolling summaries on long-horizon tasks.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.