Related papers: Evidence for Daily and Weekly Periodic Variability in GPT-4o Performance

Evidence for Daily and Weekly Periodic Variability in GPT-4o Performance

URL: http://arxiv.org/abs/2602.15889v1
Date: Fri, 06 Feb 2026 13:41:07 GMT
Title: Evidence for Daily and Weekly Periodic Variability in GPT-4o Performance
Authors: Paul Tschisgale, Peter Wulff,
Abstract summary: Large language models (LLMs) are increasingly used in research.<n>Much of this work implicitly assumes that LLM performance under fixed conditions is time-invariant.<n>We conducted a longitudinal study on the temporal variability of GPT-4o's average performance.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) are increasingly used in research both as tools and as objects of investigation. Much of this work implicitly assumes that LLM performance under fixed conditions (identical model snapshot, hyperparameters, and prompt) is time-invariant. If average output quality changes systematically over time, this assumption is violated, threatening the reliability, validity, and reproducibility of findings. To empirically examine this assumption, we conducted a longitudinal study on the temporal variability of GPT-4o's average performance. Using a fixed model snapshot, fixed hyperparameters, and identical prompting, GPT-4o was queried via the API to solve the same multiple-choice physics task every three hours for approximately three months. Ten independent responses were generated at each time point and their scores were averaged. Spectral (Fourier) analysis of the resulting time series revealed notable periodic variability in average model performance, accounting for approximately 20% of the total variance. In particular, the observed periodic patterns are well explained by the interaction of a daily and a weekly rhythm. These findings indicate that, even under controlled conditions, LLM performance may vary periodically over time, calling into question the assumption of time invariance. Implications for ensuring validity and replicability of research that uses or investigates LLMs are discussed.

Related papers

PHAT: Modeling Period Heterogeneity for Multivariate Time Series Forecasting [30.347634829157766]
We propose PHAT (Period Heterogeneity-Aware Transformer) for modeling periodicity in real-world data.<n>By restricting interactions within buckets and masking cross-bucket connections, PHAT effectively avoids interference from inconsistent periods.<n>We evaluate PHAT on 14 real-world datasets against 18 baselines, and the results show that it significantly outperforms existing methods.
arXiv Detail & Related papers (2026-01-31T10:58:09Z)
TSAQA: Time Series Analysis Question And Answering Benchmark [85.35545785252309]
Time series data are integral to critical applications across domains such as finance, healthcare, transportation, and environmental science.<n>We introduce TSAQA, a novel unified benchmark designed to broaden task coverage and evaluate diverse temporal analysis capabilities.
arXiv Detail & Related papers (2026-01-30T17:28:56Z)
Not in Sync: Unveiling Temporal Bias in Audio Chat Models [59.146710538620816]
Large Audio Language Models (LALMs) are increasingly applied to audio understanding and multimodal reasoning.<n>We present the first systematic study of temporal bias in LALMs, revealing a key limitation in their timestamp prediction.
arXiv Detail & Related papers (2025-10-14T06:29:40Z)
Time-RA: Towards Time Series Reasoning for Anomaly with LLM Feedback [55.284574165467525]
Time-series Reasoning for Anomaly (Time-RA) transforms classical time series anomaly detection into a generative, reasoning-intensive task.<n>Also, we introduce the first real-world multimodal benchmark dataset, RATs40K, explicitly annotated for anomaly reasoning.
arXiv Detail & Related papers (2025-07-20T18:02:50Z)
Breaking Silos: Adaptive Model Fusion Unlocks Better Time Series Forecasting [64.45587649141842]
Time-series forecasting plays a critical role in many real-world applications.<n>No single model consistently outperforms others across different test samples, but instead (ii) each model excels in specific cases.<n>We introduce TimeFuse, a framework for collective time-series forecasting with sample-level adaptive fusion of heterogeneous models.
arXiv Detail & Related papers (2025-05-24T00:45:07Z)
General Time-series Model for Universal Knowledge Representation of Multivariate Time-Series data [61.163542597764796]
We show that time series with different time granularities (or corresponding frequency resolutions) exhibit distinct joint distributions in the frequency domain.<n>A novel Fourier knowledge attention mechanism is proposed to enable learning time-aware representations from both the temporal and frequency domains.<n>An autoregressive blank infilling pre-training framework is incorporated to time series analysis for the first time, leading to a generative tasks agnostic pre-training strategy.
arXiv Detail & Related papers (2025-02-05T15:20:04Z)
Are Large Language Models Useful for Time Series Data Analysis? [3.44393516559102]
Time series data plays a critical role across diverse domains such as healthcare, energy, and finance.<n>This study investigates whether large language models (LLMs) are effective for time series data analysis.
arXiv Detail & Related papers (2024-12-16T02:47:44Z)
Unveiling Divergent Inductive Biases of LLMs on Temporal Data [4.561800294155325]
This research focuses on evaluating the performance of GPT-3.5 and GPT-4 models in the analysis of temporal data. biases toward specific temporal relationships come to light, with GPT-3.5 demonstrating a preference for "AFTER'' in the QA format for both implicit and explicit events, while GPT-4 leans towards "BEFORE''
arXiv Detail & Related papers (2024-04-01T19:56:41Z)
TimeDRL: Disentangled Representation Learning for Multivariate Time-Series [10.99576829280084]
TimeDRL is a generic time-series representation learning framework with disentangled dual-level embeddings. TimeDRL consistently surpasses existing representation learning approaches, achieving an average improvement of 58.02% in MSE and classification by 1.48% in accuracy.
arXiv Detail & Related papers (2023-12-07T08:56:44Z)
Compatible Transformer for Irregularly Sampled Multivariate Time Series [75.79309862085303]
We propose a transformer-based encoder to achieve comprehensive temporal-interaction feature learning for each individual sample. We conduct extensive experiments on 3 real-world datasets and validate that the proposed CoFormer significantly and consistently outperforms existing methods.
arXiv Detail & Related papers (2023-10-17T06:29:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.