Related papers: On the Temporal Question-Answering Capabilities of Large Language Models Over Anonymized Data

On the Temporal Question-Answering Capabilities of Large Language Models Over Anonymized Data

URL: http://arxiv.org/abs/2504.07646v1
Date: Thu, 10 Apr 2025 10:48:42 GMT
Title: On the Temporal Question-Answering Capabilities of Large Language Models Over Anonymized Data
Authors: Alfredo Garrachón Ruiz, Tomás de la Rosa, Daniel Borrajo,
Abstract summary: The applicability of Large Language Models (LLMs) in temporal reasoning tasks over data that is not present during training is still a field that remains to be explored.<n>In this paper we work on this topic, focusing on structured and semi-structured anonymized data.<n>We identify and examined seventeen common temporal reasoning tasks in natural language, focusing on their algorithmic components.
Score: 1.2979906794584584
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The applicability of Large Language Models (LLMs) in temporal reasoning tasks over data that is not present during training is still a field that remains to be explored. In this paper we work on this topic, focusing on structured and semi-structured anonymized data. We not only develop a direct LLM pipeline, but also compare various methodologies and conduct an in-depth analysis. We identified and examined seventeen common temporal reasoning tasks in natural language, focusing on their algorithmic components. To assess LLM performance, we created the \textit{Reasoning and Answering Temporal Ability} dataset (RATA), featuring semi-structured anonymized data to ensure reliance on reasoning rather than on prior knowledge. We compared several methodologies, involving SoTA techniques such as Tree-of-Thought, self-reflexion and code execution, tuned specifically for this scenario. Our results suggest that achieving scalable and reliable solutions requires more than just standalone LLMs, highlighting the need for integrated approaches.

Related papers

IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z)
Learning to Reason Over Time: Timeline Self-Reflection for Improved Temporal Reasoning in Language Models [21.579319926212296]
Large Language Models (LLMs) have emerged as powerful tools for generating coherent text, understanding context, and performing reasoning tasks. They struggle with temporal reasoning, which requires processing time-related information such as event sequencing, durations, and inter-temporal relationships. We introduce TISER, a novel framework that enhances the temporal reasoning abilities of LLMs through a multi-stage process that combines timeline construction with iterative self-reflection.
arXiv Detail & Related papers (2025-04-07T16:51:45Z)
ChronoSense: Exploring Temporal Understanding in Large Language Models with Time Intervals of Events [0.20132569095596248]
We present ChronoSense, a new benchmark for evaluating Large Language Models' temporal understanding.<n>We assess the performance of seven recent LLMs using this benchmark and the results indicate that models handle Allen relations, even symmetrical ones, quite differently.<n>Overall, the models' low performance highlights the need for improved temporal understanding in LLMs.
arXiv Detail & Related papers (2025-01-06T14:27:41Z)
Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context.<n>We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters.<n>We propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark.
arXiv Detail & Related papers (2024-10-24T17:56:08Z)
A Comprehensive Evaluation of Large Language Models on Temporal Event Forecasting [45.0261082985087]
We conduct a comprehensive evaluation of Large Language Models (LLMs) for temporal event forecasting. We find that directly integrating raw texts into the input of LLMs does not enhance zero-shot extrapolation performance. In contrast, incorporating raw texts in specific complex events and fine-tuning LLMs significantly improves performance.
arXiv Detail & Related papers (2024-07-16T11:58:54Z)
SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM. We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z)
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z)
Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning [20.066249913943405]
Large language models (LLMs) have showcased remarkable reasoning capabilities, yet they remain susceptible to errors. We introduce novel synthetic datasets specifically designed to assess LLM temporal reasoning abilities in various scenarios. Our findings provide valuable insights into the strengths and weaknesses of current LLMs in temporal reasoning tasks.
arXiv Detail & Related papers (2024-06-13T14:31:19Z)
LLM Inference Unveiled: Survey and Roofline Model Insights [62.92811060490876]
Large Language Model (LLM) inference is rapidly evolving, presenting a unique blend of opportunities and challenges. Our survey stands out from traditional literature reviews by not only summarizing the current state of research but also by introducing a framework based on roofline model. This framework identifies the bottlenecks when deploying LLMs on hardware devices and provides a clear understanding of practical problems.
arXiv Detail & Related papers (2024-02-26T07:33:05Z)
Improving Open Information Extraction with Large Language Models: A Study on Demonstration Uncertainty [52.72790059506241]
Open Information Extraction (OIE) task aims at extracting structured facts from unstructured text. Despite the potential of large language models (LLMs) like ChatGPT as a general task solver, they lag behind state-of-the-art (supervised) methods in OIE tasks.
arXiv Detail & Related papers (2023-09-07T01:35:24Z)
StructGPT: A General Framework for Large Language Model to Reason over Structured Data [117.13986738340027]
We develop an emphIterative Reading-then-Reasoning(IRR) approach for solving question answering tasks based on structured data. Our approach can significantly boost the performance of ChatGPT and achieve comparable performance against the full-data supervised-tuning baselines.
arXiv Detail & Related papers (2023-05-16T17:45:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.