Related papers: Revisiting LLMs as Zero-Shot Time-Series Forecasters: Small Noise Can Break Large Models

Revisiting LLMs as Zero-Shot Time-Series Forecasters: Small Noise Can Break Large Models

URL: http://arxiv.org/abs/2506.00457v1
Date: Sat, 31 May 2025 08:24:01 GMT
Title: Revisiting LLMs as Zero-Shot Time-Series Forecasters: Small Noise Can Break Large Models
Authors: Junwoo Park, Hyuck Lee, Dohyun Lee, Daehoon Gwak, Jaegul Choo,
Abstract summary: Large Language Models (LLMs) have shown remarkable performance across diverse tasks without domain-specific training.<n>Recent studies suggest that LLMs lack inherent effectiveness in forecasting.<n>Our experiments show that LLM-based zero-shot forecasters often struggle to achieve high accuracy due to their sensitivity to noise.
Score: 32.30528039193554
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have shown remarkable performance across diverse tasks without domain-specific training, fueling interest in their potential for time-series forecasting. While LLMs have shown potential in zero-shot forecasting through prompting alone, recent studies suggest that LLMs lack inherent effectiveness in forecasting. Given these conflicting findings, a rigorous validation is essential for drawing reliable conclusions. In this paper, we evaluate the effectiveness of LLMs as zero-shot forecasters compared to state-of-the-art domain-specific models. Our experiments show that LLM-based zero-shot forecasters often struggle to achieve high accuracy due to their sensitivity to noise, underperforming even simple domain-specific models. We have explored solutions to reduce LLMs' sensitivity to noise in the zero-shot setting, but improving their robustness remains a significant challenge. Our findings suggest that rather than emphasizing zero-shot forecasting, a more promising direction would be to focus on fine-tuning LLMs to better process numerical sequences. Our experimental code is available at https://github.com/junwoopark92/revisiting-LLMs-zeroshot-forecaster.

Related papers

Rethinking the Role of LLMs in Time Series Forecasting [15.951870420397682]
Large language models (LLMs) have been introduced to time series forecasting (TSF) to incorporate contextual knowledge beyond numerical signals.<n>We show that such conclusions stem from limited evaluation settings and do not hold at scale.<n>Our results demonstrate that emphLLM4TS indeed improves forecasting performance, with especially large gains in cross-domain generalization.
arXiv Detail & Related papers (2026-02-16T13:39:09Z)
Enhancing Zero-Shot Time Series Forecasting in Off-the-Shelf LLMs via Noise Injection [18.267727687739853]
Large Language Models (LLMs) have demonstrated effectiveness as zero-shot time series (TS) forecasters.<n>The key challenge lies in tokenizing TS data into textual representations that align with LLMs' pre-trained knowledge.<n>We introduce two novel TS datasets that fall outside all utilized LLMs' pre-training scopes, and consistently observe improved performance.
arXiv Detail & Related papers (2025-12-23T08:02:33Z)
The Forecast Critic: Leveraging Large Language Models for Poor Forecast Identification [74.64864354503204]
We propose The Forecast Critic, a system that leverages Large Language Models (LLMs) for automated forecast monitoring.<n>We evaluate the ability of LLMs to assess time series forecast quality.<n>We present three experiments, including on both synthetic and real-world forecasting data.
arXiv Detail & Related papers (2025-12-12T21:59:53Z)
Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads [104.9566359759396]
We propose a lightweight alternative for step-level reasoning verification based on data-driven uncertainty scores.<n>Our findings suggest that the internal states of LLMs encode their uncertainty and can serve as reliable signals for reasoning verification.
arXiv Detail & Related papers (2025-11-09T03:38:29Z)
Beyond Naïve Prompting: Strategies for Improved Zero-shot Context-aided Forecasting with LLMs [57.82819770709032]
Large language models (LLMs) can be effective context-aided forecasters via na"ive direct prompting.<n>ReDP improves interpretability by eliciting explicit reasoning traces, allowing us to assess the model's reasoning over the context.<n>CorDP leverages LLMs solely to refine existing forecasts with context, enhancing their applicability in real-world forecasting pipelines.<n> IC-DP proposes embedding historical examples of context-aided forecasting tasks in the prompt, substantially improving accuracy even for the largest models.
arXiv Detail & Related papers (2025-08-13T16:02:55Z)
Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers [59.168391398830515]
We evaluate 12 pre-trained LLMs and one specialized fact-verifier, using a collection of examples from 14 fact-checking benchmarks.<n>We highlight the importance of addressing annotation errors and ambiguity in datasets.<n> frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance.
arXiv Detail & Related papers (2025-06-16T10:32:10Z)
Small or Large? Zero-Shot or Finetuned? Guiding Language Model Choice for Specialized Applications in Healthcare [1.9296797946506608]
Finetuning significantly improved SLM performance across all scenarios compared to their zero-shot results.<n> domain-adjacent SLMs generally performed better than the generic SLM after finetuning, especially on harder tasks.<n>Further domain-specific pretraining yielded modest gains on easier tasks but significant improvements on the complex, data-scarce task.
arXiv Detail & Related papers (2025-04-29T21:50:06Z)
Adversarial Vulnerabilities in Large Language Models for Time Series Forecasting [14.579802892916101]
Large Language Models (LLMs) have recently demonstrated significant potential in time series forecasting.<n>However, their robustness and reliability in real-world applications remain under-explored.<n>We introduce a targeted adversarial attack framework for LLM-based time series forecasting.
arXiv Detail & Related papers (2024-12-11T04:53:15Z)
Predicting Emergent Capabilities by Finetuning [98.9684114851891]
We find that finetuning language models can shift the point in scaling at which emergence occurs towards less capable models. We validate this approach using four standard NLP benchmarks. We find that, in some cases, we can accurately predict whether models trained with up to 4x more compute have emerged.
arXiv Detail & Related papers (2024-11-25T01:48:09Z)
From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning [91.79567270986901]
Large Language Models (LLMs) tend to prioritize adherence to user prompts over providing veracious responses.<n>Recent works propose to employ supervised fine-tuning (SFT) to mitigate the sycophancy issue.<n>We propose a novel supervised pinpoint tuning (SPT), where the region-of-interest modules are tuned for a given objective.
arXiv Detail & Related papers (2024-09-03T07:01:37Z)
The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism [39.392450788666814]
Current evaluations of large language models (LLMs) often overlook non-determinism. greedy decoding generally outperforms sampling methods for most evaluated tasks. Smaller LLMs can match or surpass larger models such as GPT-4-Turbo.
arXiv Detail & Related papers (2024-07-15T06:12:17Z)
Temporal Scaling Law for Large Language Models [57.83580734589091]
We propose the novel concept of Temporal Scaling Law, studying how the test loss of an LLM evolves as the training steps scale up.<n>In contrast to modeling the test loss as a whole in a coarse-grained manner, we break it down and dive into the fine-grained test loss of each token position.<n>We derive the much more precise temporal scaling law by studying the temporal patterns of the parameters in the dynamic hyperbolic-law.
arXiv Detail & Related papers (2024-04-27T05:49:11Z)
Time Series Forecasting with LLMs: Understanding and Enhancing Model Capabilities [46.02234423159257]
Large language models (LLMs) have been applied in many fields and have developed rapidly in recent years.<n>Recent works treat large language models as emphzero-shot time series reasoners without further fine-tuning.<n>Our study shows that LLMs perform well in predicting time series with clear patterns and trends, but face challenges with datasets lacking periodicity.
arXiv Detail & Related papers (2024-02-16T17:15:28Z)
Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs [67.38165028487242]
We introduce Dynamic Sparse No Training (DSnoT), a training-free fine-tuning approach to fine-tune large language models (LLMs) Inspired by the Dynamic Sparse Training, DSnoT minimizes the reconstruction error between the dense and sparse LLMs. Our paper offers fresh insights into how to fine-tune sparse LLMs in an efficient training-free manner and open new venues to scale the great potential of sparsity to LLMs.
arXiv Detail & Related papers (2023-10-13T07:38:52Z)
Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools. Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions. Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)
Certified Robustness for Large Language Models with Self-Denoising [42.916661225753145]
We propose to denoise the corrupted inputs with large language models (LLMs) in a self-denoising manner. Our method outperforms the existing certification methods under both certified robustness and empirical robustness.
arXiv Detail & Related papers (2023-07-14T05:40:24Z)
On Learning to Summarize with Large Language Models as References [101.79795027550959]
Large language models (LLMs) are favored by human annotators over the original reference summaries in commonly used summarization datasets. We study an LLM-as-reference learning setting for smaller text summarization models to investigate whether their performance can be substantially improved.
arXiv Detail & Related papers (2023-05-23T16:56:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.