TIME: Temporally Intelligent Meta-reasoning Engine for Context Triggered Explicit Reasoning
- URL: http://arxiv.org/abs/2601.05300v1
- Date: Thu, 08 Jan 2026 13:24:49 GMT
- Title: TIME: Temporally Intelligent Meta-reasoning Engine for Context Triggered Explicit Reasoning
- Authors: Susmit Das,
- Abstract summary: We introduce TIME, a framework that treats explicit reasoning as a context sensitive resource driven by discourse and temporal cues.<n>Time augments dialogue with optional ISO 8601 time> tags, tick turns that represent silent gaps, and short think> blocks that can appear anywhere in a reply.<n>We evaluate TIMEBench, a dialogue benchmark probing chronology, commonsense under gaps and offsets, anomaly detection, and continuity.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reasoning oriented large language models often expose explicit "thinking" as long, turn-global traces at the start of every response, either always on or toggled externally at inference time. While useful for arithmetic, programming, and problem solving, this design is costly, blurs claim level auditability, and cannot re-trigger explicit reasoning once the model begins presenting. Dialogue models are also largely blind to temporal structure, treating replies after seconds and replies after weeks as equivalent unless time is stated in text. We introduce TIME, the Temporally Intelligent Meta-reasoning Engine, a behavioral alignment framework that treats explicit reasoning as a context sensitive resource driven by discourse and temporal cues. TIME augments dialogue with optional ISO 8601 <time> tags, tick turns that represent silent gaps, and short <think> blocks that can appear anywhere in a reply. A four-phase curriculum including a small, maximally diverse full-batch alignment step trains Qwen3 dense models to invoke brief, in-place reasoning bursts and keep user facing text compact. We evaluate with TIMEBench, a temporally grounded dialogue benchmark probing chronology, commonsense under gaps and offsets, anomaly detection, and continuity. Across 4B to 32B scales, TIME improves TIMEBench scores over base Qwen3 in both thinking and no-thinking modes while reducing reasoning tokens by about an order of magnitude. Our training data and code are available at https://github.com/The-Coherence-Initiative/TIME and TIMEBench is available at https://github.com/The-Coherence-Initiative/TIMEBench
Related papers
- Game-Time: Evaluating Temporal Dynamics in Spoken Language Models [93.844257719952]
We introduce the Game-Time Benchmark framework to assess temporal capabilities.<n>Our evaluation of diverse SLM models reveals a clear performance disparity.<n>The GameTime Benchmark provides a foundation for guiding future research toward more temporally-aware conversational AI.
arXiv Detail & Related papers (2025-09-30T15:23:39Z) - KnowMT-Bench: Benchmarking Knowledge-Intensive Long-Form Question Answering in Multi-Turn Dialogues [58.305425399644086]
Multi-Turn Long-Form Question Answering (MT-LFQA) is a key application paradigm of Large Language Models (LLMs) in knowledge-intensive domains.<n>We introduce textbfKnowMT-Bench, the textitfirst-ever benchmark designed to systematically evaluate MT-LFQA for LLMs across knowledge-intensive fields.
arXiv Detail & Related papers (2025-09-26T04:32:29Z) - Thinking Before You Speak: A Proactive Test-time Scaling Approach [54.8205006555199]
We implement our idea as a reasoning framework, named emphThinking Before You Speak (TBYS)<n>We design a pipeline for automatically collecting and filtering in-context examples for the generation of emphinsights.<n>Experiments on challenging mathematical datasets verify the effectiveness of TBYS.
arXiv Detail & Related papers (2025-08-26T03:43:32Z) - STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models [131.90117151306993]
Spoken Language Models (SLMs) are designed to take speech inputs and produce spoken responses.<n>Current SLMs lack the ability to perform an internal, unspoken thinking process before responding.<n>We propose Stitch, a novel generation method that alternates between the generation of unspoken reasoning chunks and spoken response chunks.
arXiv Detail & Related papers (2025-07-21T08:30:03Z) - From What to Respond to When to Respond: Timely Response Generation for Open-domain Dialogue Agents [26.437011114518917]
TimelyChat benchmark evaluates the capabilities of language models to predict appropriate time intervals and generate time-conditioned responses.<n>We construct a large-scale training dataset by leveraging unlabeled event knowledge from a temporal commonsense knowledge graph.<n>We then train Timer, a dialogue agent designed to proactively predict time intervals and generate timely responses that align with those intervals.
arXiv Detail & Related papers (2025-06-17T07:56:32Z) - TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios [34.611056451149416]
We propose a benchmark TIME, designed for temporal reasoning in real-world scenarios.<n> TIME consists of 38,522 QA pairs, covering 3 levels with 11 fine-grained sub-tasks.<n>We conduct extensive experiments on reasoning models and non-reasoning models.<n>We release TIME-Lite, a human-annotated subset to foster future research and standardized evaluation in temporal reasoning.
arXiv Detail & Related papers (2025-05-19T09:22:02Z) - Once Upon a $\textit{Time}$ in $\textit{Graph}$: Relative-Time
Pretraining for Complex Temporal Reasoning [96.03608822291136]
We make use of the underlying nature of time, and suggest creating a graph structure based on the relative placements of events along the time axis.
Inspired by the graph view, we propose RemeMo, which explicitly connects all temporally-scoped facts by modeling the time relations between any two sentences.
Experimental results show that RemeMo outperforms the baseline T5 on multiple temporal question answering datasets.
arXiv Detail & Related papers (2023-10-23T08:49:00Z) - Towards Benchmarking and Improving the Temporal Reasoning Capability of
Large Language Models [44.670550143705746]
We introduce a comprehensive probing dataset tempreason to evaluate the temporal reasoning capability of large language models.
Our dataset includes questions of three temporal reasoning levels.
We also propose a novel learning framework to improve the temporal reasoning capability of large language models.
arXiv Detail & Related papers (2023-06-15T08:44:41Z) - TIMEDIAL: Temporal Commonsense Reasoning in Dialog [43.24596551545824]
We present the first study to investigate pre-trained language models for their temporal reasoning capabilities in dialogs.
We formulate TIME-DIAL as a multiple-choice cloze task with over 1.1K carefully curated dialogs.
Empirical results demonstrate that even the best performing models struggle on this task compared to humans.
arXiv Detail & Related papers (2021-06-08T17:59:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.