Are Large Language Models Temporally Grounded?
- URL: http://arxiv.org/abs/2311.08398v2
- Date: Thu, 16 Nov 2023 09:41:28 GMT
- Title: Are Large Language Models Temporally Grounded?
- Authors: Yifu Qiu, Zheng Zhao, Yftah Ziser, Anna Korhonen, Edoardo M. Ponti,
Shay B. Cohen
- Abstract summary: We provide Large language models (LLMs) with textual narratives.
We probe them with respect to their common-sense knowledge of the structure and duration of events.
We evaluate state-of-the-art LLMs on three tasks reflecting these abilities.
- Score: 38.481606493496514
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Are Large language models (LLMs) temporally grounded? Since LLMs cannot
perceive and interact with the environment, it is impossible to answer this
question directly. Instead, we provide LLMs with textual narratives and probe
them with respect to their common-sense knowledge of the structure and duration
of events, their ability to order events along a timeline, and self-consistency
within their temporal model (e.g., temporal relations such as after and before
are mutually exclusive for any pair of events). We evaluate state-of-the-art
LLMs (such as LLaMA 2 and GPT-4) on three tasks reflecting these abilities.
Generally, we find that LLMs lag significantly behind both human performance as
well as small-scale, specialised LMs. In-context learning, instruction tuning,
and chain-of-thought prompting reduce this gap only to a limited degree.
Crucially, LLMs struggle the most with self-consistency, displaying incoherent
behaviour in at least 27.23% of their predictions. Contrary to expectations, we
also find that scaling the model size does not guarantee positive gains in
performance. To explain these results, we study the sources from which LLMs may
gather temporal information: we find that sentence ordering in unlabelled
texts, available during pre-training, is only weakly correlated with event
ordering. Moreover, public instruction tuning mixtures contain few temporal
tasks. Hence, we conclude that current LLMs lack a consistent temporal model of
textual narratives. Code, datasets, and LLM outputs are available at
https://github.com/yfqiu-nlp/temporal-llms.
Related papers
- Will LLMs Replace the Encoder-Only Models in Temporal Relation Classification? [2.1861408994125253]
Large Language Models (LLM) have recently shown promising performance in temporal reasoning tasks.
Recent studies have tested the LLMs' performance in detecting temporal relations of closed-source models only.
arXiv Detail & Related papers (2024-10-14T13:10:45Z) - Are LLMs Good Annotators for Discourse-level Event Relation Extraction? [15.365993658296016]
Large Language Models (LLMs) have demonstrated proficiency in a wide array of natural language processing tasks.
Our study reveals a notable underperformance of LLMs compared to the baseline established through supervised learning.
arXiv Detail & Related papers (2024-07-28T19:27:06Z) - The Strong Pull of Prior Knowledge in Large Language Models and Its Impact on Emotion Recognition [74.04775677110179]
In-context Learning (ICL) has emerged as a powerful paradigm for performing natural language tasks with Large Language Models (LLM)
We show that LLMs have strong yet inconsistent priors in emotion recognition that ossify their predictions.
Our results suggest that caution is needed when using ICL with larger LLMs for affect-centered tasks outside their pre-training domain.
arXiv Detail & Related papers (2024-03-25T19:07:32Z) - Time Series Forecasting with LLMs: Understanding and Enhancing Model Capabilities [46.02234423159257]
Large language models (LLMs) have been applied in many fields and have developed rapidly in recent years.
Recent works treat large language models as emphzero-shot time series reasoners without further fine-tuning.
Our study shows that LLMs perform well in predicting time series with clear patterns and trends, but face challenges with datasets lacking periodicity.
arXiv Detail & Related papers (2024-02-16T17:15:28Z) - LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning [67.39585115936329]
We argue that LLMs have inherent capabilities to handle long contexts without fine-tuning.
We propose SelfExtend to extend the context window of LLMs by constructing bi-level attention information.
We conduct comprehensive experiments on multiple benchmarks and the results show that our SelfExtend can effectively extend existing LLMs' context window length.
arXiv Detail & Related papers (2024-01-02T18:30:51Z) - AlignedCoT: Prompting Large Language Models via Native-Speaking Demonstrations [52.43593893122206]
Alignedcot is an in-context learning technique for invoking Large Language Models.
It achieves consistent and correct step-wise prompts in zero-shot scenarios.
We conduct experiments on mathematical reasoning and commonsense reasoning.
arXiv Detail & Related papers (2023-11-22T17:24:21Z) - TRACE: A Comprehensive Benchmark for Continual Learning in Large
Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety.
Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs.
We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z) - MenatQA: A New Dataset for Testing the Temporal Comprehension and
Reasoning Abilities of Large Language Models [17.322480769274062]
Large language models (LLMs) have shown nearly saturated performance on many natural language processing (NLP) tasks.
This paper constructs Multiple Sensitive Factors Time QA (MenatQA) with total 2,853 samples for evaluating the time comprehension and reasoning abilities of LLMs.
arXiv Detail & Related papers (2023-10-08T13:19:52Z) - LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation.
We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset.
Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z) - Event knowledge in large language models: the gap between the impossible
and the unlikely [46.540380831486125]
We show that pre-trained large language models (LLMs) possess substantial event knowledge.
They almost always assign higher likelihood to possible vs. impossible events.
However, they show less consistent preferences for likely vs. unlikely events.
arXiv Detail & Related papers (2022-12-02T23:43:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.