Related papers: CIMemories: A Compositional Benchmark for Contextual Integrity of Persistent Memory in LLMs

CIMemories: A Compositional Benchmark for Contextual Integrity of Persistent Memory in LLMs

URL: http://arxiv.org/abs/2511.14937v1
Date: Tue, 18 Nov 2025 21:51:23 GMT
Title: CIMemories: A Compositional Benchmark for Contextual Integrity of Persistent Memory in LLMs
Authors: Niloofar Mireshghallah, Neal Mangaokar, Narine Kokhlikyan, Arman Zharmagambetov, Manzil Zaheer, Saeed Mahloujifar, Kamalika Chaudhuri,
Abstract summary: Large Language Models (LLMs) increasingly use persistent memory from past interactions to enhance personalization and task performance.<n>We present CIMemories, a benchmark for evaluating whether LLMs appropriately control information flow from memory based on task context.
Score: 62.116710797795314
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large Language Models (LLMs) increasingly use persistent memory from past interactions to enhance personalization and task performance. However, this memory introduces critical risks when sensitive information is revealed in inappropriate contexts. We present CIMemories, a benchmark for evaluating whether LLMs appropriately control information flow from memory based on task context. CIMemories uses synthetic user profiles with over 100 attributes per user, paired with diverse task contexts in which each attribute may be essential for some tasks but inappropriate for others. Our evaluation reveals that frontier models exhibit up to 69% attribute-level violations (leaking information inappropriately), with lower violation rates often coming at the cost of task utility. Violations accumulate across both tasks and runs: as usage increases from 1 to 40 tasks, GPT-5's violations rise from 0.1% to 9.6%, reaching 25.1% when the same prompt is executed 5 times, revealing arbitrary and unstable behavior in which models leak different attributes for identical prompts. Privacy-conscious prompting does not solve this - models overgeneralize, sharing everything or nothing rather than making nuanced, context-dependent decisions. These findings reveal fundamental limitations that require contextually aware reasoning capabilities, not just better prompting or scaling.

Related papers

MemPO: Self-Memory Policy Optimization for Long-Horizon Agents [52.00646524941419]
Existing methods typically introduce the external memory module and look up the relevant information from the stored memory.<n>We propose the self-memory policy optimization algorithm (MemPO), which enables the agent to autonomously summarize and manage their memory.<n>MemPO achieves absolute F1 score gains of 25.98% over the base model and 7.1% over the previous SOTA baseline, while reducing token usage by 67.58% and 73.12%.
arXiv Detail & Related papers (2026-02-28T14:43:02Z)
Scene-Aware Memory Discrimination: Deciding Which Personal Knowledge Stays [14.981027641902221]
We introduce a Scene-Aware Memory Discrimination method (SAMD) to address large-scale interactions and diverse memory standards.<n>We show that SAMD successfully recalls the majority of memorable data and remains robust in dynamic scenarios.<n>When integrated into personalized applications, SAMD significantly enhances both the efficiency and quality of memory construction, leading to better organization of personal knowledge.
arXiv Detail & Related papers (2026-02-12T05:53:54Z)
Mem2ActBench: A Benchmark for Evaluating Long-Term Memory Utilization in Task-Oriented Autonomous Agents [20.357475946040054]
We introduce textscMem2ActBench, a benchmark for evaluating whether agents can proactively leverage long-term memory to execute tool-based actions.<n>A reverse-generation method produces 400 tool-use tasks, with human evaluation confirming 91.3% are strongly memory-dependent.
arXiv Detail & Related papers (2026-01-13T06:22:32Z)
Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements [78.87065404966002]
Existing benchmarks predominantly curate questions at the question level.<n>We propose Encyclo-K, a statement-based benchmark that rethinks benchmark construction from the ground up.
arXiv Detail & Related papers (2025-12-31T13:55:54Z)
PersonaMem-v2: Towards Personalized Intelligence via Learning Implicit User Personas and Agentic Memory [56.81126490418336]
Personalization is one of the next milestones in advancing AI capability and alignment.<n> PersonaMem-v2 simulates 1,000 realistic user-chatbot interactions on 300+ scenarios, 20,000+ user preferences, and 128k-token context windows.<n>We train Qwen3-4B to outperforms GPT-5, reaching 53% accuracy in implicit personalization.
arXiv Detail & Related papers (2025-12-07T06:48:23Z)
MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning [73.27233666920618]
We propose MemSearcher, an agent workflow that iteratively maintains a compact memory and combines the current turn with it.<n>At each turn, MemSearcher fuses the user's question with the memory to generate reasoning traces, perform search actions, and update memory to retain only information essential for solving the task.<n>We introduce multi-context GRPO, an end-to-end RL framework that jointly optimize reasoning, search strategies, and memory management of MemSearcher Agents.
arXiv Detail & Related papers (2025-11-04T18:27:39Z)
Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs [28.807582003957005]
We present a framework for evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning.<n>We then construct BEAM, a new benchmark comprising 100 conversations and 2,000 validated questions.<n>To enhance model performance, we propose LIGHT-a framework inspired by human cognition that equips LLMs with three complementary memory systems.
arXiv Detail & Related papers (2025-10-31T07:29:52Z)
Operationalizing Data Minimization for Privacy-Preserving LLM Prompting [10.031739933859622]
Large language models (LLMs) in consumer applications have led to frequent exchanges of personal information.<n>We present a framework to formally define and operationalize data minimization.<n>We evaluate the framework on four datasets spanning open-ended conversations and knowledge-intensive tasks.
arXiv Detail & Related papers (2025-10-04T04:20:18Z)
OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks [52.87238755666243]
We present OmniEAR, a framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks.<n>We model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains.<n>Our systematic evaluation reveals severe performance degradation when models must reason from constraints.
arXiv Detail & Related papers (2025-08-07T17:54:15Z)
Teaching Language Models To Gather Information Proactively [53.85419549904644]
Large language models (LLMs) are increasingly expected to function as collaborative partners.<n>In this work, we introduce a new task paradigm: proactive information gathering.<n>We design a scalable framework that generates partially specified, real-world tasks, masking key information.<n>Within this setup, our core innovation is a reinforcement finetuning strategy that rewards questions that elicit genuinely new, implicit user information.
arXiv Detail & Related papers (2025-07-28T23:50:09Z)
MAGPIE: A dataset for Multi-AGent contextual PrIvacy Evaluation [54.410825977390274]
Existing benchmarks to evaluate contextual privacy in LLM-agents primarily assess single-turn, low-complexity tasks.<n>We first present a benchmark - MAGPIE comprising 158 real-life high-stakes scenarios across 15 domains.<n>We then evaluate the current state-of-the-art LLMs on their understanding of contextually private data and their ability to collaborate without violating user privacy.
arXiv Detail & Related papers (2025-06-25T18:04:25Z)
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory [0.5584627289325719]
Large Language Models (LLMs) have demonstrated remarkable prowess in generating contextually coherent responses.<n>But their fixed context windows pose fundamental challenges for maintaining consistency over prolonged multi-session dialogues.<n>We introduce Mem0, a scalable memory-centric architecture that addresses this issue by dynamically extracting, consolidating, and retrieving salient information from ongoing conversations.
arXiv Detail & Related papers (2025-04-28T01:46:35Z)
Verbosity $\neq$ Veracity: Demystify Verbosity Compensation Behavior of Large Language Models [8.846200844870767]
We discover an understudied type of undesirable behavior of Large Language Models (LLMs)<n>We term Verbosity Compensation (VC) as similar to the hesitation behavior of humans under uncertainty.<n>We propose a simple yet effective cascade algorithm that replaces verbose responses with the other model-generated responses.
arXiv Detail & Related papers (2024-11-12T15:15:20Z)
Do as I do (Safely): Mitigating Task-Specific Fine-tuning Risks in Large Language Models [93.08860674071636]
We show how malicious actors can subtly manipulate the structure of almost any task-specific dataset to foster dangerous model behaviors.<n>We propose a novel mitigation strategy that mixes in safety data which mimics the task format and prompting style of the user data.
arXiv Detail & Related papers (2024-06-12T18:33:11Z)
Memory Sharing for Large Language Model based Agents [43.53494041932615]
This paper introduces the Memory Sharing, a framework which integrates the real-time memory filter, storage and retrieval to enhance the In-Context Learning process. The experimental results demonstrate that the MS framework significantly improves the agents' performance in addressing open-ended questions.
arXiv Detail & Related papers (2024-04-15T17:57:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.