VISTA: Vision-Language Inference for Training-Free Stock Time-Series Analysis
- URL: http://arxiv.org/abs/2505.18570v3
- Date: Wed, 11 Jun 2025 18:38:02 GMT
- Title: VISTA: Vision-Language Inference for Training-Free Stock Time-Series Analysis
- Authors: Tina Khezresmaeilzadeh, Parsa Razmara, Seyedarmin Azizi, Mohammad Erfan Sadeghi, Erfan Baghaei Potraghloo,
- Abstract summary: We introduce VISTA (Vision-Language Inference for Stock Time-series Analysis), a training-free framework for multi-modal stock forecasting.<n>We benchmark VISTA against standard baselines, including ARIMA and text-only LLM-based prompting methods.<n>We show that VISTA outperforms these baselines by up to 89.83%, demonstrating the effectiveness of multi-modal inference for stock time-series analysis.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Stock price prediction remains a complex and high-stakes task in financial analysis, traditionally addressed using statistical models or, more recently, language models. In this work, we introduce VISTA (Vision-Language Inference for Stock Time-series Analysis), a novel, training-free framework that leverages Vision-Language Models (VLMs) for multi-modal stock forecasting. VISTA prompts a VLM with both textual representations of historical stock prices and their corresponding line charts to predict future price values. By combining numerical and visual modalities in a zero-shot setting and using carefully designed chain-of-thought prompts, VISTA captures complementary patterns that unimodal approaches often miss. We benchmark VISTA against standard baselines, including ARIMA and text-only LLM-based prompting methods. Experimental results show that VISTA outperforms these baselines by up to 89.83%, demonstrating the effectiveness of multi-modal inference for stock time-series analysis and highlighting the potential of VLMs in financial forecasting tasks without requiring task-specific training.
Related papers
- Reasoning on Time-Series for Financial Technical Analysis [45.81831399666851]
We introduce Verbal Technical Analysis (VTA), a novel framework that combine verbal and latent reasoning to produce stock time-series forecasts.<n> Experiments on stock datasets across U.S., Chinese, and European markets show that VTA achieves state-of-the-art forecasting accuracy.
arXiv Detail & Related papers (2025-11-06T15:21:57Z) - Training Vision-Language Process Reward Models for Test-Time Scaling in Multimodal Reasoning: Key Insights and Lessons Learned [29.44294456857936]
Process Reward Models (PRMs) improve the reliability of reasoning in large language models.<n>Existing Vision-Language PRMs rely on Monte Carlo Tree Search (MCTS) for data construction.<n>We introduce a hybrid data framework that combines MCTS with judgments from a strong VLM, producing more accurate step-level labels.
arXiv Detail & Related papers (2025-09-27T10:56:58Z) - Multimodal Language Models with Modality-Specific Experts for Financial Forecasting from Interleaved Sequences of Text and Time Series [18.185361179633553]
Text and time series data offer complementary views of financial markets.<n>We propose a unified neural architecture that models these interleaved sequences using modality-specific experts.<n>We demonstrate the effectiveness of our approach on a large-scale financial forecasting task.
arXiv Detail & Related papers (2025-09-23T22:40:31Z) - Decoupled Visual Interpretation and Linguistic Reasoning for Math Problem Solving [57.22004912994658]
Current large vision-language models (LVLMs) typically employ a connector module to link visual features with text embeddings of large language models (LLMs)<n>This paper proposes a paradigm shift: instead of training end-to-end vision-language reasoning models, we advocate for developing a decoupled reasoning framework.
arXiv Detail & Related papers (2025-05-23T08:18:00Z) - Evaluating Financial Sentiment Analysis with Annotators Instruction Assisted Prompting: Enhancing Contextual Interpretation and Stock Prediction Accuracy [1.481550828146527]
Annotators' Instruction Assisted Prompt (AIAP) aims to standardize the understanding of sentiment across both human and machine interpretations.<n>We utilize a new dataset, WSBS, derived from the WallStreetBets subreddit to demonstrate how AIAP significantly enhances LLM performance.<n>This context-aware approach yields incremental gains in performance and also introduces an innovative sentiment-indexing method.
arXiv Detail & Related papers (2025-05-09T19:44:04Z) - BreakGPT: Leveraging Large Language Models for Predicting Asset Price Surges [55.2480439325792]
This paper introduces BreakGPT, a novel large language model (LLM) architecture adapted specifically for time series forecasting and the prediction of sharp upward movements in asset prices.
We showcase BreakGPT as a promising solution for financial forecasting with minimal training and as a strong competitor for capturing both local and global temporal dependencies.
arXiv Detail & Related papers (2024-11-09T05:40:32Z) - Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context.<n>We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters.<n>We propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark.
arXiv Detail & Related papers (2024-10-24T17:56:08Z) - StockTime: A Time Series Specialized Large Language Model Architecture for Stock Price Prediction [13.52020491768311]
We introduce StockTime, a novel LLM-based architecture designed specifically for stock price time series data.
Unlike recent FinLLMs, StockTime is specifically designed for stock price time series data.
By fusing this multimodal data, StockTime effectively predicts stock prices across arbitrary look-back periods.
arXiv Detail & Related papers (2024-08-25T00:50:33Z) - Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z) - FinVis-GPT: A Multimodal Large Language Model for Financial Chart
Analysis [15.20897845057384]
FinVis-GPT is a novel multimodal large language model (LLM) specifically designed for financial chart analysis.
The proposed FinVis-GPT serves as a pioneering effort in utilizing multimodal LLMs in the finance domain.
arXiv Detail & Related papers (2023-07-31T07:44:15Z) - SimVLM: Simple Visual Language Model Pretraining with Weak Supervision [48.98275876458666]
We present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM)
SimVLM reduces the training complexity by exploiting large-scale weak supervision.
It achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks.
arXiv Detail & Related papers (2021-08-24T18:14:00Z) - Unsupervised Vision-and-Language Pre-training Without Parallel Images
and Captions [92.47566804182338]
We investigate if a strong V&L representation model can be learned through unsupervised pre-training without image-caption corpora.
In particular, we propose to conduct mask-and-predict'' pre-training on text-only and image-only corpora.
We find that such a simple approach performance close to a model pre-trained with aligned data, on four English V&L benchmarks.
arXiv Detail & Related papers (2020-10-24T08:17:54Z) - Behind the Scene: Revealing the Secrets of Pre-trained
Vision-and-Language Models [65.19308052012858]
Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research.
We present VALUE, a set of meticulously designed probing tasks to decipher the inner workings of multimodal pre-training.
Key observations: Pre-trained models exhibit a propensity for attending over text rather than images during inference.
arXiv Detail & Related papers (2020-05-15T01:06:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.