Are Large Language Models Temporally Grounded?
- URL: http://arxiv.org/abs/2311.08398v2
- Date: Thu, 16 Nov 2023 09:41:28 GMT
- Title: Are Large Language Models Temporally Grounded?
- Authors: Yifu Qiu, Zheng Zhao, Yftah Ziser, Anna Korhonen, Edoardo M. Ponti,
Shay B. Cohen
- Abstract summary: We provide Large language models (LLMs) with textual narratives.
We probe them with respect to their common-sense knowledge of the structure and duration of events.
We evaluate state-of-the-art LLMs on three tasks reflecting these abilities.
- Score: 38.481606493496514
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Are Large language models (LLMs) temporally grounded? Since LLMs cannot
perceive and interact with the environment, it is impossible to answer this
question directly. Instead, we provide LLMs with textual narratives and probe
them with respect to their common-sense knowledge of the structure and duration
of events, their ability to order events along a timeline, and self-consistency
within their temporal model (e.g., temporal relations such as after and before
are mutually exclusive for any pair of events). We evaluate state-of-the-art
LLMs (such as LLaMA 2 and GPT-4) on three tasks reflecting these abilities.
Generally, we find that LLMs lag significantly behind both human performance as
well as small-scale, specialised LMs. In-context learning, instruction tuning,
and chain-of-thought prompting reduce this gap only to a limited degree.
Crucially, LLMs struggle the most with self-consistency, displaying incoherent
behaviour in at least 27.23% of their predictions. Contrary to expectations, we
also find that scaling the model size does not guarantee positive gains in
performance. To explain these results, we study the sources from which LLMs may
gather temporal information: we find that sentence ordering in unlabelled
texts, available during pre-training, is only weakly correlated with event
ordering. Moreover, public instruction tuning mixtures contain few temporal
tasks. Hence, we conclude that current LLMs lack a consistent temporal model of
textual narratives. Code, datasets, and LLM outputs are available at
https://github.com/yfqiu-nlp/temporal-llms.
Related papers
- LinkGPT: Teaching Large Language Models To Predict Missing Links [23.57145845001286]
Large Language Models (LLMs) have shown promising results on various language and vision tasks.
Recently, there has been growing interest in applying LLMs to graph-based tasks, particularly on Text-Attributed Graphs (TAGs)
arXiv Detail & Related papers (2024-06-07T04:54:36Z) - "Sorry, Come Again?" Prompting -- Enhancing Comprehension and Diminishing Hallucination with [PAUSE]-injected Optimal Paraphrasing [10.20632187568563]
Hallucination has emerged as the most vulnerable aspect of contemporary Large Language Models (LLMs)
In this paper, we introduce the Sorry, Come Again (SCA) prompting, aimed to avoid LLM hallucinations.
We provide an in-depth analysis of linguistic nuances: formality, readability, and concreteness of prompts for 21 LLMs.
We propose an optimal paraphrasing technique to identify the most comprehensible paraphrase of a given prompt.
arXiv Detail & Related papers (2024-03-27T19:45:09Z) - The Strong Pull of Prior Knowledge in Large Language Models and Its Impact on Emotion Recognition [74.04775677110179]
In-context Learning (ICL) has emerged as a powerful paradigm for performing natural language tasks with Large Language Models (LLM)
We show that LLMs have strong yet inconsistent priors in emotion recognition that ossify their predictions.
Our results suggest that caution is needed when using ICL with larger LLMs for affect-centered tasks outside their pre-training domain.
arXiv Detail & Related papers (2024-03-25T19:07:32Z) - Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement [75.7148545929689]
Large language models (LLMs) improve their performance through self-feedback on certain tasks while degrade on others.
We formally define LLM's self-bias - the tendency to favor its own generation.
We analyze six LLMs on translation, constrained text generation, and mathematical reasoning tasks.
arXiv Detail & Related papers (2024-02-18T03:10:39Z) - Time Series Forecasting with LLMs: Understanding and Enhancing Model
Capabilities [39.874834611685124]
Large language models (LLMs) have been applied in many fields with rapid development in recent years.
This paper shows that LLMs excel in predicting time series with clear patterns and trends but face challenges with datasets lacking periodicity.
In addition, the input strategy is investigated, and it is found that incorporating external knowledge and adopting natural language paraphrases positively affects the predictive performance of LLMs for time series.
arXiv Detail & Related papers (2024-02-16T17:15:28Z) - LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning [67.39585115936329]
We argue that LLMs have inherent capabilities to handle long contexts without fine-tuning.
We propose SelfExtend to extend the context window of LLMs by constructing bi-level attention information.
We conduct comprehensive experiments on multiple benchmarks and the results show that our SelfExtend can effectively extend existing LLMs' context window length.
arXiv Detail & Related papers (2024-01-02T18:30:51Z) - TRACE: A Comprehensive Benchmark for Continual Learning in Large
Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety.
Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs.
We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z) - MenatQA: A New Dataset for Testing the Temporal Comprehension and
Reasoning Abilities of Large Language Models [17.322480769274062]
Large language models (LLMs) have shown nearly saturated performance on many natural language processing (NLP) tasks.
This paper constructs Multiple Sensitive Factors Time QA (MenatQA) with total 2,853 samples for evaluating the time comprehension and reasoning abilities of LLMs.
arXiv Detail & Related papers (2023-10-08T13:19:52Z) - LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation.
We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset.
Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z) - Event knowledge in large language models: the gap between the impossible
and the unlikely [46.540380831486125]
We show that pre-trained large language models (LLMs) possess substantial event knowledge.
They almost always assign higher likelihood to possible vs. impossible events.
However, they show less consistent preferences for likely vs. unlikely events.
arXiv Detail & Related papers (2022-12-02T23:43:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.