BEDTime: A Unified Benchmark for Automatically Describing Time Series
- URL: http://arxiv.org/abs/2509.05215v2
- Date: Tue, 30 Sep 2025 03:31:52 GMT
- Title: BEDTime: A Unified Benchmark for Automatically Describing Time Series
- Authors: Medhasweta Sen, Zachary Gottesman, Jiaxing Qiu, C. Bayan Bruss, Nam Nguyen, Tom Hartvigsen,
- Abstract summary: We argue that successful multi-modal models should be able to recognize, differentiate, and generate language descriptions of time series.<n>We then create BEDTime, the first benchmark dataset to assess models on each task.<n>Using BEDTime, we evaluate 13 state-of-the-art models, and find that dedicated time series foundation models severely underperform.
- Score: 8.466823017204641
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent works propose complex multi-modal models that handle both time series and language, ultimately claiming high performance on complex tasks like time series reasoning and cross-modal question-answering. However, they skip evaluations of simple and important foundational tasks, which complex models should reliably master. They also lack direct, head-to-head comparisons with other popular approaches. So we ask a simple question: Can recent models even produce generic visual descriptions of time series data? In response, we propose three new tasks, posing that successful multi-modal models should be able to recognize, differentiate, and generate language descriptions of time series. We then create BEDTime, the first benchmark dataset to assess models on each task, comprising four datasets reformatted for these tasks across multiple modalities. Using BEDTime, we evaluate 13 state-of-the-art models, and find that (1) surprisingly, dedicated time series foundation models severely underperform, despite being designed for similar tasks, (2) vision-language models are quite capable, (3) language-only methods perform worst, despite many lauding their potential, and (4) all approaches are clearly fragile to a range of realistic robustness tests, indicating avenues for future work.
Related papers
- UniT: Unified Multimodal Chain-of-Thought Test-time Scaling [85.590774707406]
Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs.<n>We introduce UniT, a framework for multimodal test-time scaling that enables a single unified model to reason, verify, and refine across multiple rounds.
arXiv Detail & Related papers (2026-02-12T18:59:49Z) - Empowering Time Series Analysis with Large-Scale Multimodal Pretraining [34.22079919081765]
Building multimodal foundation models is a natural next step, but it faces key challenges.<n>Lack of a unified multimodal pretraining paradigm and large-scale multimodal corpora for time series analysis.<n>How to effectively integrate heterogeneous modalities and enhance model generalization.
arXiv Detail & Related papers (2026-02-05T13:26:35Z) - SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection [79.58755811919366]
This paper introduces a new task called Multi-Modal datasets and Multi-Task Object Detection (M2Det) for remote sensing.<n>It is designed to accurately detect horizontal or oriented objects from any sensor modality.<n>This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization.
arXiv Detail & Related papers (2024-12-30T02:47:51Z) - How to Merge Your Multimodal Models Over Time? [73.11304741033761]
We propose a unified framework called TIME which defines temporal model merging across three axes.<n>We study temporal model merging across model sizes, compute budgets, and learning horizons on the FoMo-in-Flux benchmark.
arXiv Detail & Related papers (2024-12-09T18:01:13Z) - P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.<n>P-MMEval delivers consistent language coverage across various datasets and provides parallel samples.<n>We conduct extensive experiments on representative multilingual model series to compare performances across models and tasks.
arXiv Detail & Related papers (2024-11-14T01:29:36Z) - Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context.<n>We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters.<n>We propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark.
arXiv Detail & Related papers (2024-10-24T17:56:08Z) - Deep Time Series Models: A Comprehensive Survey and Benchmark [74.28364194333447]
Time series data is of great significance in real-world scenarios.
Recent years have witnessed remarkable breakthroughs in the time series community.
We release Time Series Library (TSLib) as a fair benchmark of deep time series models for diverse analysis tasks.
arXiv Detail & Related papers (2024-07-18T08:31:55Z) - Advancing Time Series Classification with Multimodal Language Modeling [6.624754582682479]
We propose InstructTime, a novel attempt to reshape time series classification as a learning-to-generate paradigm.
The core idea is to formulate the classification of time series as a multimodal understanding task, in which both task-specific instructions and raw time series are treated as multimodal inputs.
Extensive experiments are conducted over benchmark datasets, whose results uncover the superior performance of InstructTime.
arXiv Detail & Related papers (2024-03-19T02:32:24Z) - UniTS: A Unified Multi-Task Time Series Model [31.675845788410246]
UniTS is a unified multi-task time series model that integrates predictive and generative tasks into a single framework.
UniTS is tested on 38 datasets across human activity sensors, healthcare, engineering, and finance.
arXiv Detail & Related papers (2024-02-29T21:25:58Z) - TOTEM: TOkenized Time Series EMbeddings for General Time Series Analysis [29.232543319667005]
This work studies the problem of time series analysis with generalist (or foundation) models.<n>We consider the simple strategy of discretely tokenizing time series data drawn from a myriad of datasets via self-supervision.<n>Our method, TOkenized Time Series EMbeddings (TOTEM), produces such generalist time series models with minimal or no fine-tuning.
arXiv Detail & Related papers (2024-02-26T09:11:12Z) - MOMENT: A Family of Open Time-series Foundation Models [19.0845213853369]
We introduce MOMENT, a family of open-source foundation models for general-purpose time series analysis.
We compile a collection of public time series, called the Time series Pile, and systematically tackle time series-specific challenges.
We build on recent work to design a benchmark to evaluate time series foundation models on diverse tasks and datasets in limited supervision settings.
arXiv Detail & Related papers (2024-02-06T10:48:46Z) - Time-LLM: Time Series Forecasting by Reprogramming Large Language Models [110.20279343734548]
Time series forecasting holds significant importance in many real-world dynamic systems.
We present Time-LLM, a reprogramming framework to repurpose large language models for time series forecasting.
Time-LLM is a powerful time series learner that outperforms state-of-the-art, specialized forecasting models.
arXiv Detail & Related papers (2023-10-03T01:31:25Z) - UnIVAL: Unified Model for Image, Video, Audio and Language Tasks [105.77733287326308]
UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model.
Our model is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning.
Thanks to the unified model, we propose a novel study on multimodal model merging via weight generalization.
arXiv Detail & Related papers (2023-07-30T09:48:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.