A Benchmark for Procedural Memory Retrieval in Language Agents
- URL: http://arxiv.org/abs/2511.21730v1
- Date: Fri, 21 Nov 2025 08:08:53 GMT
- Title: A Benchmark for Procedural Memory Retrieval in Language Agents
- Authors: Ishant Kohar, Aswanth Krishnan,
- Abstract summary: Current AI agents excel in familiar settings, but fail sharply when faced with novel tasks with unseenProc.<n>We present the first benchmark that isolates procedural memory retrieval from task execution.<n>Our results expose a clear generalization cliff: embedding-based methods perform strongly on familiar contexts, yet degrade considerably on novel ones.
- Score: 0.023227405857540805
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current AI agents excel in familiar settings, but fail sharply when faced with novel tasks with unseen vocabularies -- a core limitation of procedural memory systems. We present the first benchmark that isolates procedural memory retrieval from task execution, evaluating whether agents can recognize functionally equivalent procedures that span different object instantiations. Using ALFWorld, we construct dual corpora of expert and LLM-generated trajectories and evaluate six retrieval methods using systematically stratified queries. Our results expose a clear generalization cliff: embedding-based methods perform strongly on familiar contexts, yet degrade considerably on novel ones, while LLM-generated procedural abstractions demonstrate reliable cross-context transfer. Controlled ablations show that although embeddings capture some lexical-level abstraction, they fundamentally treat procedures as unordered bags of words, discarding temporal structure necessary for cross-context transfer. Corpus scale delivers far larger gains than representation enrichment, revealing an architectural ceiling in current encoders. Our benchmark offers the first diagnostic framework separating genuine procedural understanding from surface-level memorization and gives tools for developing retrieval systems capable of dependable generalization. Resources available at our GitHub repository (https://github.com/qpiai/Proced_mem_bench).
Related papers
- Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory [89.65731902036669]
Evo-Memory is a streaming benchmark and framework for evaluating self-evolving memory in large language model (LLM) agents.<n>We evaluate over ten representative memory modules and evaluate them across 10 diverse multi-turn goal-oriented and single-turn reasoning and QA datasets.
arXiv Detail & Related papers (2025-11-25T21:08:07Z) - LLM-guided Hierarchical Retrieval [54.73080745446999]
LATTICE is a hierarchical retrieval framework that enables an LLM to reason over and navigate large corpora with logarithmic search complexity.<n>A central challenge in such LLM-guided search is that the model's relevance judgments are noisy, context-dependent, and unaware of the hierarchy.<n>Our framework achieves state-of-the-art zero-shot performance on the reasoning-intensive BRIGHT benchmark.
arXiv Detail & Related papers (2025-10-15T07:05:17Z) - CAM: A Constructivist View of Agentic Memory for LLM-Based Reading Comprehension [55.29309306566238]
Current Large Language Models (LLMs) are confronted with overwhelming information volume when comprehending long-form documents.<n>This challenge raises the imperative of a cohesive memory module, which can elevate vanilla LLMs into autonomous reading agents.<n>We draw inspiration from Jean Piaget's Constructivist Theory, illuminating three traits of the agentic memory -- structured schemata, flexible assimilation, and dynamic accommodation.
arXiv Detail & Related papers (2025-10-07T02:16:30Z) - H$^2$R: Hierarchical Hindsight Reflection for Multi-Task LLM Agents [3.9054156855794973]
Large language model (LLM)-based agents have shown strong potential in multi-task scenarios.<n>Existing approaches often treat prior experiences and knowledge as monolithic units, leading to inefficient and coarse-grained knowledge transfer.<n>We propose a novel hierarchical memory architecture that enables fine-grained knowledge transfer.
arXiv Detail & Related papers (2025-09-16T08:30:08Z) - Hierarchical Memory for High-Efficiency Long-Term Reasoning in LLM Agents [19.04968632268433]
We propose a hierarchical memory architecture for Large Language Model Agents (LLM Agents)<n>Each memory vector is embedded with a positional index encoding pointing to its semantically related sub-memories in the next layer.<n>During the reasoning phase, an index-based routing mechanism enables efficient, layer-by-layer retrieval without performing exhaustive similarity computations.
arXiv Detail & Related papers (2025-07-23T12:45:44Z) - Tree-Based Text Retrieval via Hierarchical Clustering in RAGFrameworks: Application on Taiwanese Regulations [0.0]
We propose a hierarchical clustering-based retrieval method that eliminates the need to predefine k.<n>Our approach maintains the accuracy and relevance of system responses while adaptively selecting semantically relevant content.<n>Our framework is simple to implement and easily integrates with existing RAG pipelines, making it a practical solution for real-world applications under limited resources.
arXiv Detail & Related papers (2025-06-16T15:34:29Z) - ELITE: Embedding-Less retrieval with Iterative Text Exploration [5.8851517822935335]
Large Language Models (LLMs) have achieved impressive progress in natural language processing.<n>Their limited ability to retain long-term context constrains performance on document-level or multi-turn tasks.
arXiv Detail & Related papers (2025-05-17T08:48:43Z) - Learning Refined Document Representations for Dense Retrieval via Deliberate Thinking [58.69615583599489]
Deliberate Thinking based Retriever (Debater) is a novel approach that enhances document representations by incorporating a step-by-step thinking process.<n>Debater significantly outperforms existing methods across several retrieval benchmarks.
arXiv Detail & Related papers (2025-02-18T15:56:34Z) - Emulating Retrieval Augmented Generation via Prompt Engineering for Enhanced Long Context Comprehension in LLMs [23.960451986662996]
This paper proposes a method that emulates Retrieval Augmented Generation (RAG) through specialized prompt engineering and chain-of-thought reasoning.<n>We evaluate our approach on selected tasks from BABILong, which interleaves standard bAbI QA problems with large amounts of distractor text.
arXiv Detail & Related papers (2025-02-18T02:49:40Z) - MemoRAG: Boosting Long Context Processing with Global Memory-Enhanced Retrieval Augmentation [60.04380907045708]
Retrieval-Augmented Generation (RAG) is considered a promising strategy to address this problem.<n>We propose MemoRAG, a novel RAG framework empowered by global memory-augmented retrieval.<n>MemoRAG achieves superior performances across a variety of long-context evaluation tasks.
arXiv Detail & Related papers (2024-09-09T13:20:31Z) - RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [66.93260816493553]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios.<n>With a focus on factual accuracy, we propose three novel metrics: Completeness, Hallucination, and Irrelevance.<n> Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z) - RET-LLM: Towards a General Read-Write Memory for Large Language Models [53.288356721954514]
RET-LLM is a novel framework that equips large language models with a general write-read memory unit.
Inspired by Davidsonian semantics theory, we extract and save knowledge in the form of triplets.
Our framework exhibits robust performance in handling temporal-based question answering tasks.
arXiv Detail & Related papers (2023-05-23T17:53:38Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.