Memory-T1: Reinforcement Learning for Temporal Reasoning in Multi-session Agents
- URL: http://arxiv.org/abs/2512.20092v1
- Date: Tue, 23 Dec 2025 06:37:29 GMT
- Title: Memory-T1: Reinforcement Learning for Temporal Reasoning in Multi-session Agents
- Authors: Yiming Du, Baojun Wang, Yifan Xiang, Zhaowei Wang, Wenyu Huang, Boyang Xue, Bin Liang, Xingshan Zeng, Fei Mi, Haoli Bai, Lifeng Shang, Jeff Z. Pan, Yuxin Jiang, Kam-Fai Wong,
- Abstract summary: We introduce Memory-T1, a framework that learns a time-aware memory selection policy using reinforcement learning (RL)<n>On the Time-Dialog benchmark, Memory-T1 boosts a 7B model to an overall score of 67.0%, establishing a new state-of-the-art performance for open-source models.
- Score: 80.33280979339123
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Temporal reasoning over long, multi-session dialogues is a critical capability for conversational agents. However, existing works and our pilot study have shown that as dialogue histories grow in length and accumulate noise, current long-context models struggle to accurately identify temporally pertinent information, significantly impairing reasoning performance. To address this, we introduce Memory-T1, a framework that learns a time-aware memory selection policy using reinforcement learning (RL). It employs a coarse-to-fine strategy, first pruning the dialogue history into a candidate set using temporal and relevance filters, followed by an RL agent that selects the precise evidence sessions. The RL training is guided by a multi-level reward function optimizing (i) answer accuracy, (ii) evidence grounding, and (iii) temporal consistency. In particular, the temporal consistency reward provides a dense signal by evaluating alignment with the query time scope at both the session-level (chronological proximity) and the utterance-level (chronological fidelity), enabling the agent to resolve subtle chronological ambiguities. On the Time-Dialog benchmark, Memory-T1 boosts a 7B model to an overall score of 67.0\%, establishing a new state-of-the-art performance for open-source models and outperforming a 14B baseline by 10.2\%. Ablation studies show temporal consistency and evidence grounding rewards jointly contribute to a 15.0\% performance gain. Moreover, Memory-T1 maintains robustness up to 128k tokens, where baseline models collapse, proving effectiveness against noise in extensive dialogue histories. The code and datasets are publicly available at https://github.com/Elvin-Yiming-Du/Memory-T1/
Related papers
- Beyond Dialogue Time: Temporal Semantic Memory for Personalized LLM Agents [68.84161689205779]
Temporal Semantic Memory (TSM) is a memory framework that models semantic time for point-wise memory.<n>TSM consistently outperforms existing methods and achieves up to 12.2% absolute improvement in accuracy.
arXiv Detail & Related papers (2026-01-12T12:24:44Z) - Rhea: Role-aware Heuristic Episodic Attention for Conversational LLMs [36.91809943381492]
Large Language Models (LLMs) have achieved remarkable performance on single-turn tasks, yet their effectiveness deteriorates in multi-turn conversations.<n>We propose Rhea, a novel framework that decouples conversation history into two functionally independent memory modules.<n>Experiments show that Rhea mitigates performance decay and improves overall accuracy by 1.04 points on a 10-point scale.
arXiv Detail & Related papers (2025-12-07T14:50:03Z) - Cognitively-Inspired Episodic Memory Architectures for Accurate and Efficient Character AI [1.0742675209112622]
Large language models show promise for embodying historical characters in dialogue systems, but existing approaches face a critical trade-off.<n>We present an architecture that resolves this tension through offline data augmentation and efficient parallel retrieval from structured episodic memory.<n>Our system transforms biographical data into 1,774 enriched first-person memories with affective-semantic metadata, then employs two-stage retrieval achieving 0.52s prompt generation.
arXiv Detail & Related papers (2025-11-01T02:26:16Z) - D-SMART: Enhancing LLM Dialogue Consistency via Dynamic Structured Memory And Reasoning Tree [22.420810089099614]
Large Language Models (LLMs) often exhibit factual inconsistencies and logical decay in extended, multi-turn dialogues.<n>We propose D--101, a model-agnostic framework designed to maintain multi-turn dialogue consistency.<n>We introduce new NLI-based metrics to better measure multiturn dialogue consistency.
arXiv Detail & Related papers (2025-10-15T09:53:11Z) - KnowMT-Bench: Benchmarking Knowledge-Intensive Long-Form Question Answering in Multi-Turn Dialogues [58.305425399644086]
Multi-Turn Long-Form Question Answering (MT-LFQA) is a key application paradigm of Large Language Models (LLMs) in knowledge-intensive domains.<n>We introduce textbfKnowMT-Bench, the textitfirst-ever benchmark designed to systematically evaluate MT-LFQA for LLMs across knowledge-intensive fields.
arXiv Detail & Related papers (2025-09-26T04:32:29Z) - From What to Respond to When to Respond: Timely Response Generation for Open-domain Dialogue Agents [26.437011114518917]
TimelyChat benchmark evaluates the capabilities of language models to predict appropriate time intervals and generate time-conditioned responses.<n>We construct a large-scale training dataset by leveraging unlabeled event knowledge from a temporal commonsense knowledge graph.<n>We then train Timer, a dialogue agent designed to proactively predict time intervals and generate timely responses that align with those intervals.
arXiv Detail & Related papers (2025-06-17T07:56:32Z) - In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents [70.12342024019044]
Large Language Models (LLMs) have made significant progress in open-ended dialogue, yet their inability to retain and retrieve relevant information limits their effectiveness.<n>We propose Reflective Memory Management (RMM), a novel mechanism for long-term dialogue agents, integrating forward- and backward-looking reflections.<n>RMM shows more than 10% accuracy improvement over the baseline without memory management on the LongMemEval dataset.
arXiv Detail & Related papers (2025-03-11T04:15:52Z) - On Memory Construction and Retrieval for Personalized Conversational Agents [69.46887405020186]
We propose SeCom, a method that constructs the memory bank at segment level by introducing a conversation segmentation model.<n> Experimental results show that SeCom exhibits a significant performance advantage over baselines on long-term conversation benchmarks LOCOMO and Long-MT-Bench+.
arXiv Detail & Related papers (2025-02-08T14:28:36Z) - TIMEDIAL: Temporal Commonsense Reasoning in Dialog [43.24596551545824]
We present the first study to investigate pre-trained language models for their temporal reasoning capabilities in dialogs.
We formulate TIME-DIAL as a multiple-choice cloze task with over 1.1K carefully curated dialogs.
Empirical results demonstrate that even the best performing models struggle on this task compared to humans.
arXiv Detail & Related papers (2021-06-08T17:59:21Z) - Temporal Memory Relation Network for Workflow Recognition from Surgical
Video [53.20825496640025]
We propose a novel end-to-end temporal memory relation network (TMNet) for relating long-range and multi-scale temporal patterns.
We have extensively validated our approach on two benchmark surgical video datasets.
arXiv Detail & Related papers (2021-03-30T13:20:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.