Related papers: Evaluating List Construction and Temporal Understanding capabilities of Large Language Models

Evaluating List Construction and Temporal Understanding capabilities of Large Language Models

URL: http://arxiv.org/abs/2506.21783v1
Date: Thu, 26 Jun 2025 21:40:58 GMT
Title: Evaluating List Construction and Temporal Understanding capabilities of Large Language Models
Authors: Alexandru Dumitru, V Venktesh, Adam Jatowt, Avishek Anand,
Abstract summary: Large Language Models (LLMs) are susceptible to hallucinations and errors on particularly temporal understanding tasks.<n>We propose the Time referenced List based Question Answering (TLQA) benchmark that requires structured answers in list format aligned with corresponding time periods.<n>We investigate the temporal understanding and list construction capabilities of state-of-the-art generative models on TLQA in closed-book and open-domain settings.
Score: 54.39278049092508
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have demonstrated immense advances in a wide range of natural language tasks. However, these models are susceptible to hallucinations and errors on particularly temporal understanding tasks involving multiple entities in answers. In such tasks, they fail to associate entities with accurate time intervals, generate a complete list of entities in answers or reason about events associated with specific temporal bounds. Existing works do not extensively evaluate the abilities of the model to perform implicit and explicit temporal understanding in a list answer construction setup. To bridge this gap, we propose the Time referenced List based Question Answering or TLQA benchmark that requires structured answers in list format aligned with corresponding time periods. Our TLQA benchmark, requires both list construction and temporal understanding simultaneously, which to the best of our knowledge has not been explored in prior benchmarks. We investigate the temporal understanding and list construction capabilities of state-of-the-art generative models on TLQA in closed-book and open-domain settings. Our findings reveal significant shortcomings in current models, particularly their inability to provide complete answers and temporally align facts in a closed-book setup and the need to improve retrieval in open-domain setup, providing clear future directions for research on TLQA. The benchmark and code at https://github.com/elixir-research-group/TLQA.

Related papers

Harnessing Temporal Databases for Systematic Evaluation of Factual Time-Sensitive Question-Answering in Large Language Models [38.12930048471948]
TDBench is a new benchmark that systematically constructs Time-Sensitive Question-Answering pairs.<n>Fine-grained evaluation metric called time accuracy assesses validity of time references in model explanations.<n> experiments on contemporary Large Language Models show how ours enables scalable and comprehensive TSQA evaluation.
arXiv Detail & Related papers (2025-08-04T04:27:06Z)
The benefits of query-based KGQA systems for complex and temporal questions in LLM era [55.20230501807337]
Large language models excel in question-answering (QA) yet still struggle with multi-hop reasoning and temporal questions.<n> Query-based knowledge graph QA (KGQA) offers a modular alternative by generating executable queries instead of direct answers.<n>We explore multi-stage query-based framework for WikiData QA, proposing multi-stage approach that enhances performance on challenging multi-hop and temporal benchmarks.
arXiv Detail & Related papers (2025-07-16T06:41:03Z)
TimeLogic: A Temporal Logic Benchmark for Video QA [64.32208175236323]
We introduce the TimeLogic QA (TLQA) framework to automatically generate temporal logical questions.<n>We leverage 4 datasets, STAR, Breakfast, AGQA, and CrossTask, and generate 2k and 10k QA pairs for each category.<n>We assess the VideoQA model's temporal reasoning performance on 16 categories of temporal logic with varying temporal complexity.
arXiv Detail & Related papers (2025-01-13T11:12:59Z)
ComplexTempQA: A Large-Scale Dataset for Complex Temporal Question Answering [24.046966640011124]
ComplexTempQA is a large-scale dataset consisting of over 100 million question-answer pairs. The dataset covers questions spanning over two decades and offers an unmatched breadth of topics.
arXiv Detail & Related papers (2024-06-07T12:01:59Z)
Self-Improvement Programming for Temporal Knowledge Graph Question Answering [31.33908040172437]
Temporal Knowledge Graph Question Answering (TKGQA) aims to answer questions with temporal intent over Temporal Knowledge Graphs (TKGs) Existing end-to-end methods implicitly model the time constraints by learning time-aware embeddings of questions and candidate answers. We introduce a novel self-improvement Programming method for TKGQA (Prog-TQA)
arXiv Detail & Related papers (2024-04-02T08:14:27Z)
Multi-hop Question Answering under Temporal Knowledge Editing [9.356343796845662]
Multi-hop question answering (MQA) under knowledge editing (KE) has garnered significant attention in the era of large language models. Existing models for MQA under KE exhibit poor performance when dealing with questions containing explicit temporal contexts. We propose TEMPoral knowLEdge augmented Multi-hop Question Answering (TEMPLE-MQA) to address this limitation.
arXiv Detail & Related papers (2024-03-30T23:22:51Z)
Self-Prompting Large Language Models for Zero-Shot Open-Domain QA [67.08732962244301]
Open-Domain Question Answering (ODQA) aims to answer questions without explicitly providing background documents. This task becomes notably challenging in a zero-shot setting where no data is available to train tailored retrieval-reader models. We propose a Self-Prompting framework to explicitly utilize the massive knowledge encoded in the parameters of Large Language Models.
arXiv Detail & Related papers (2022-12-16T18:23:43Z)
A Benchmark for Generalizable and Interpretable Temporal Question Answering over Knowledge Bases [67.33560134350427]
TempQA-WD is a benchmark dataset for temporal reasoning. It is based on Wikidata, which is the most frequently curated, openly available knowledge base.
arXiv Detail & Related papers (2022-01-15T08:49:09Z)
Temporal and Object Quantification Networks [95.64650820186706]
We present a new class of neuro-symbolic networks with a structural bias that enables them to learn to recognize complex relational-temporal events. We demonstrate that TOQ-Nets can generalize from small amounts of data to scenarios containing more objects than were present during training and to temporal warpings of input sequences.
arXiv Detail & Related papers (2021-06-10T16:18:21Z)
KILT: a Benchmark for Knowledge Intensive Language Tasks [102.33046195554886]
We present a benchmark for knowledge-intensive language tasks (KILT) All tasks in KILT are grounded in the same snapshot of Wikipedia. We find that a shared dense vector index coupled with a seq2seq model is a strong baseline.
arXiv Detail & Related papers (2020-09-04T15:32:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.