Language Models Still Struggle to Zero-shot Reason about Time Series
- URL: http://arxiv.org/abs/2404.11757v1
- Date: Wed, 17 Apr 2024 21:27:33 GMT
- Title: Language Models Still Struggle to Zero-shot Reason about Time Series
- Authors: Mike A. Merrill, Mingtian Tan, Vinayak Gupta, Tom Hartvigsen, Tim Althoff,
- Abstract summary: Time series are critical for decision-making in fields like finance and healthcare.
It remains unknown whether non-trivial forecasting implies that language models can reason about time series.
We generate a first-of-its-kind evaluation framework for time series reasoning.
- Score: 11.764833497297493
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Time series are critical for decision-making in fields like finance and healthcare. Their importance has driven a recent influx of works passing time series into language models, leading to non-trivial forecasting on some datasets. But it remains unknown whether non-trivial forecasting implies that language models can reason about time series. To address this gap, we generate a first-of-its-kind evaluation framework for time series reasoning, including formal tasks and a corresponding dataset of multi-scale time series paired with text captions across ten domains. Using these data, we probe whether language models achieve three forms of reasoning: (1) Etiological Reasoning - given an input time series, can the language model identify the scenario that most likely created it? (2) Question Answering - can a language model answer factual questions about time series? (3) Context-Aided Forecasting - does highly relevant textual context improve a language model's time series forecasts? We find that otherwise highly-capable language models demonstrate surprisingly limited time series reasoning: they score marginally above random on etiological and question answering tasks (up to 30 percentage points worse than humans) and show modest success in using context to improve forecasting. These weakness showcase that time series reasoning is an impactful, yet deeply underdeveloped direction for language model research. We also make our datasets and code public at to support further research in this direction at https://github.com/behavioral-data/TSandLanguage
Related papers
- Large language models can be zero-shot anomaly detectors for time series? [9.249657468385779]
sigllm is a framework for time series anomaly detection using large language models.
We present a prompt-based detection method that directly asks a language model to indicate which elements of the input are anomalies.
We show that the forecasting method significantly outperformed the prompting method in all 11 datasets with respect to the F1 score.
arXiv Detail & Related papers (2024-05-23T16:21:57Z) - A Survey of Time Series Foundation Models: Generalizing Time Series Representation with Large Language Model [33.17908422599714]
Large language foundation models have unveiled their capabilities for cross-task transferability, zero-shot/few-shot learning, and decision-making explainability.
There are two main research lines, namely pre-training foundation models from scratch for time series and adapting large language foundation models for time series.
This survey offers a 3E analytical framework for comprehensive examination of related research.
arXiv Detail & Related papers (2024-05-03T03:12:55Z) - Large Language Models Are Zero-Shot Time Series Forecasters [48.73953666153385]
By encoding time series as a string of numerical digits, we can frame time series forecasting as next-token prediction in text.
We find that large language models (LLMs) such as GPT-3 and LLaMA-2 can surprisingly zero-shot extrapolate time series at a level comparable to or exceeding the performance of purpose-built time series models trained on the downstream tasks.
arXiv Detail & Related papers (2023-10-11T19:01:28Z) - Pushing the Limits of Pre-training for Time Series Forecasting in the
CloudOps Domain [54.67888148566323]
We introduce three large-scale time series forecasting datasets from the cloud operations domain.
We show it is a strong zero-shot baseline and benefits from further scaling, both in model and dataset size.
Accompanying these datasets and results is a suite of comprehensive benchmark results comparing classical and deep learning baselines to our pre-trained method.
arXiv Detail & Related papers (2023-10-08T08:09:51Z) - Time-LLM: Time Series Forecasting by Reprogramming Large Language Models [110.20279343734548]
Time series forecasting holds significant importance in many real-world dynamic systems.
We present Time-LLM, a reprogramming framework to repurpose large language models for time series forecasting.
Time-LLM is a powerful time series learner that outperforms state-of-the-art, specialized forecasting models.
arXiv Detail & Related papers (2023-10-03T01:31:25Z) - Jamp: Controlled Japanese Temporal Inference Dataset for Evaluating
Generalization Capacity of Language Models [18.874880342410876]
We present Jamp, a Japanese benchmark focused on temporal inference.
Our dataset includes a range of temporal inference patterns, which enables us to conduct fine-grained analysis.
We evaluate the generalization capacities of monolingual/multilingual LMs by splitting our dataset based on tense fragments.
arXiv Detail & Related papers (2023-06-19T07:00:14Z) - Towards Benchmarking and Improving the Temporal Reasoning Capability of
Large Language Models [44.670550143705746]
We introduce a comprehensive probing dataset tempreason to evaluate the temporal reasoning capability of large language models.
Our dataset includes questions of three temporal reasoning levels.
We also propose a novel learning framework to improve the temporal reasoning capability of large language models.
arXiv Detail & Related papers (2023-06-15T08:44:41Z) - A Generative Language Model for Few-shot Aspect-Based Sentiment Analysis [90.24921443175514]
We focus on aspect-based sentiment analysis, which involves extracting aspect term, category, and predicting their corresponding polarities.
We propose to reformulate the extraction and prediction tasks into the sequence generation task, using a generative language model with unidirectional attention.
Our approach outperforms the previous state-of-the-art (based on BERT) on average performance by a large margins in few-shot and full-shot settings.
arXiv Detail & Related papers (2022-04-11T18:31:53Z) - A Dataset for Answering Time-Sensitive Questions [88.95075983560331]
Time is an important dimension in our physical world. Lots of facts can evolve with respect to time.
It is important to consider the time dimension and empower the existing QA models to reason over time.
The existing QA datasets contain rather few time-sensitive questions, hence not suitable for diagnosing or benchmarking the model's temporal reasoning capability.
arXiv Detail & Related papers (2021-08-13T16:42:25Z) - Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages.
We infer this distribution from a sample of typologically diverse training languages.
We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z) - TIMEDIAL: Temporal Commonsense Reasoning in Dialog [43.24596551545824]
We present the first study to investigate pre-trained language models for their temporal reasoning capabilities in dialogs.
We formulate TIME-DIAL as a multiple-choice cloze task with over 1.1K carefully curated dialogs.
Empirical results demonstrate that even the best performing models struggle on this task compared to humans.
arXiv Detail & Related papers (2021-06-08T17:59:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.