Unveiling Divergent Inductive Biases of LLMs on Temporal Data
- URL: http://arxiv.org/abs/2404.01453v1
- Date: Mon, 1 Apr 2024 19:56:41 GMT
- Title: Unveiling Divergent Inductive Biases of LLMs on Temporal Data
- Authors: Sindhu Kishore, Hangfeng He,
- Abstract summary: This research focuses on evaluating the performance of GPT-3.5 and GPT-4 models in the analysis of temporal data.
biases toward specific temporal relationships come to light, with GPT-3.5 demonstrating a preference for "AFTER'' in the QA format for both implicit and explicit events, while GPT-4 leans towards "BEFORE''
- Score: 4.561800294155325
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Unraveling the intricate details of events in natural language necessitates a subtle understanding of temporal dynamics. Despite the adeptness of Large Language Models (LLMs) in discerning patterns and relationships from data, their inherent comprehension of temporal dynamics remains a formidable challenge. This research meticulously explores these intrinsic challenges within LLMs, with a specific emphasis on evaluating the performance of GPT-3.5 and GPT-4 models in the analysis of temporal data. Employing two distinct prompt types, namely Question Answering (QA) format and Textual Entailment (TE) format, our analysis probes into both implicit and explicit events. The findings underscore noteworthy trends, revealing disparities in the performance of GPT-3.5 and GPT-4. Notably, biases toward specific temporal relationships come to light, with GPT-3.5 demonstrating a preference for "AFTER'' in the QA format for both implicit and explicit events, while GPT-4 leans towards "BEFORE''. Furthermore, a consistent pattern surfaces wherein GPT-3.5 tends towards "TRUE'', and GPT-4 exhibits a preference for "FALSE'' in the TE format for both implicit and explicit events. This persistent discrepancy between GPT-3.5 and GPT-4 in handling temporal data highlights the intricate nature of inductive bias in LLMs, suggesting that the evolution of these models may not merely mitigate bias but may introduce new layers of complexity.
Related papers
- Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games [56.70628673595041]
Large Language Models (LLMs) have been increasingly used in real-world settings, yet their strategic decision-making abilities remain largely unexplored.
This work investigates the performance and merits of LLMs in canonical game-theoretic two-player non-zero-sum games, Stag Hunt and Prisoner Dilemma.
Our structured evaluation of GPT-3.5, GPT-4-Turbo, GPT-4o, and Llama-3-8B shows that these models, when making decisions in these games, are affected by at least one of the following systematic biases.
arXiv Detail & Related papers (2024-07-05T12:30:02Z) - An Empirical Analysis on Large Language Models in Debate Evaluation [10.677407097411768]
We investigate the capabilities and inherent biases of advanced large language models (LLMs) such as GPT-3.5 and GPT-4 in the context of debate evaluation.
We uncover a consistent bias in both GPT-3.5 and GPT-4 towards the second candidate response presented.
We also uncover lexical biases in both GPT-3.5 and GPT-4, especially when label sets carry connotations such as numerical or sequential.
arXiv Detail & Related papers (2024-05-28T18:34:53Z) - LM4OPT: Unveiling the Potential of Large Language Models in Formulating
Mathematical Optimization Problems [0.0]
This study compares prominent Large Language Models, including GPT-3.5, GPT-4, and Llama-2-7b, in zero-shot and one-shot settings.
Our findings show GPT-4's superior performance, particularly in the one-shot scenario.
arXiv Detail & Related papers (2024-03-02T23:32:33Z) - Behind the Screen: Investigating ChatGPT's Dark Personality Traits and
Conspiracy Beliefs [0.0]
This paper analyzes the dark personality traits and conspiracy beliefs of GPT-3.5 and GPT-4.
Dark personality traits and conspiracy beliefs were not particularly pronounced in either model.
arXiv Detail & Related papers (2024-02-06T16:03:57Z) - A negation detection assessment of GPTs: analysis with the xNot360
dataset [9.165119034384027]
Negation is a fundamental aspect of natural language, playing a critical role in communication and comprehension.
We focus on the identification of negation in natural language using a zero-shot prediction approach applied to our custom xNot360 dataset.
Our findings expose a considerable performance disparity among the GPT models, with GPT-4 surpassing its counterparts and GPT-3.5 displaying a marked performance reduction.
arXiv Detail & Related papers (2023-06-29T02:27:48Z) - Exploring the Trade-Offs: Unified Large Language Models vs Local
Fine-Tuned Models for Highly-Specific Radiology NLI Task [49.50140712943701]
We evaluate the performance of ChatGPT/GPT-4 on a radiology NLI task and compare it to other models fine-tuned specifically on task-related data samples.
We also conduct a comprehensive investigation on ChatGPT/GPT-4's reasoning ability by introducing varying levels of inference difficulty.
arXiv Detail & Related papers (2023-04-18T17:21:48Z) - How Robust is GPT-3.5 to Predecessors? A Comprehensive Study on Language
Understanding Tasks [65.7949334650854]
GPT-3.5 models have demonstrated impressive performance in various Natural Language Processing (NLP) tasks.
However, their robustness and abilities to handle various complexities of the open world have yet to be explored.
We show that GPT-3.5 faces some specific robustness challenges, including instability, prompt sensitivity, and number sensitivity.
arXiv Detail & Related papers (2023-03-01T07:39:01Z) - Generic Temporal Reasoning with Differential Analysis and Explanation [61.96034987217583]
We introduce a novel task named TODAY that bridges the gap with temporal differential analysis.
TODAY evaluates whether systems can correctly understand the effect of incremental changes.
We show that TODAY's supervision style and explanation annotations can be used in joint learning.
arXiv Detail & Related papers (2022-12-20T17:40:03Z) - Prompting GPT-3 To Be Reliable [117.23966502293796]
This work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality.
We find that GPT-3 outperforms smaller-scale supervised models by large margins on all these facets.
arXiv Detail & Related papers (2022-10-17T14:52:39Z) - News Summarization and Evaluation in the Era of GPT-3 [73.48220043216087]
We study how GPT-3 compares against fine-tuned models trained on large summarization datasets.
We show that not only do humans overwhelmingly prefer GPT-3 summaries, prompted using only a task description, but these also do not suffer from common dataset-specific issues such as poor factuality.
arXiv Detail & Related papers (2022-09-26T01:04:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.