Related papers: FinTexTS: Financial Text-Paired Time-Series Dataset via Semantic-Based and Multi-Level Pairing

FinTexTS: Financial Text-Paired Time-Series Dataset via Semantic-Based and Multi-Level Pairing

URL: http://arxiv.org/abs/2603.02702v1
Date: Tue, 03 Mar 2026 07:45:57 GMT
Title: FinTexTS: Financial Text-Paired Time-Series Dataset via Semantic-Based and Multi-Level Pairing
Authors: Jaehoon Lee, Suhwan Park, Tae Yoon Lim, Seunghan Lee, Jun Seo, Dongwan Kang, Hwanil Choi, Minjae Kim, Sungdong Yoo, SoonYoung Lee, Yongjae Lee, Wonbin Ahn,
Abstract summary: We propose a semantic-based and multi-level pairing framework to pair text with financial time-series data.<n>We show that applying our method to proprietary yet carefully curated news sources leads to higher-quality paired data and improved stock price forecasting performance.
Score: 33.23601503890859
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The financial domain involves a variety of important time-series problems. Recently, time-series analysis methods that jointly leverage textual and numerical information have gained increasing attention. Accordingly, numerous efforts have been made to construct text-paired time-series datasets in the financial domain. However, financial markets are characterized by complex interdependencies, in which a company's stock price is influenced not only by company-specific events but also by events in other companies and broader macroeconomic factors. Existing approaches that pair text with financial time-series data based on simple keyword matching often fail to capture such complex relationships. To address this limitation, we propose a semantic-based and multi-level pairing framework. Specifically, we extract company-specific context for the target company from SEC filings and apply an embedding-based matching mechanism to retrieve semantically relevant news articles based on this context. Furthermore, we classify news articles into four levels (macro-level, sector-level, related company-level, and target-company level) using large language models (LLMs), enabling multi-level pairing of news articles with the target company. Applying this framework to publicly-available news datasets, we construct \textbf{FinTexTS}, a new large-scale text-paired stock price dataset. Experimental results on \textbf{FinTexTS} demonstrate the effectiveness of our semantic-based and multi-level pairing strategy in stock price forecasting. In addition to publicly-available news underlying \textbf{FinTexTS}, we show that applying our method to proprietary yet carefully curated news sources leads to higher-quality paired data and improved stock price forecasting performance.

Related papers

Enhancing Business Analytics through Hybrid Summarization of Financial Reports [0.152292571922932]
Financial reports and earnings communications contain large volumes of structured and semi structured information.<n>We present a hybrid summarization framework that combines extractive and abstractive techniques to produce concise and factually reliable summaries.<n>These findings support the development of practical summarization systems for distilling lengthy financial texts into usable business insights.
arXiv Detail & Related papers (2025-12-28T16:25:12Z)
FinSight: Towards Real-World Financial Deep Research [68.31086471310773]
FinSight is a novel framework for producing high-quality, multimodal financial reports.<n>To ensure professional-grade visualization, we propose an Iterative Vision-Enhanced Mechanism.<n>A two-stage Writing Framework expands concise Chain-of-Analysis segments into coherent, citation-aware, and multimodal reports.
arXiv Detail & Related papers (2025-10-19T14:05:35Z)
When Does Multimodality Lead to Better Time Series Forecasting? [96.26052272121615]
We investigate whether and under what conditions such multimodal integration consistently yields gains.<n>Our findings reveal that the benefits of multimodality are highly condition-dependent.<n>Our study offers a rigorous, quantitative foundation for understanding when multimodality can be expected to aid forecasting tasks.
arXiv Detail & Related papers (2025-06-20T23:55:56Z)
MTBench: A Multimodal Time Series Benchmark for Temporal Reasoning and Question Answering [21.064096256892686]
Multimodal time-series datasets fall short in evaluating cross-modal reasoning and complex question answering.<n>We introduce Multimodal Time Series Benchmark (MTBench), a benchmark to evaluate large language models (LLMs) on time series and text understanding.<n>We evaluate state-of-the-art LLMs on MTbench, analyzing their effectiveness in modeling the complex relationships between news narratives and temporal patterns.
arXiv Detail & Related papers (2025-03-21T05:04:53Z)
Towards Temporal-Aware Multi-Modal Retrieval Augmented Generation in Finance [79.78247299859656]
FinTMMBench is the first comprehensive benchmark for evaluating temporal-aware multi-modal Retrieval-Augmented Generation systems in finance.<n>Built from heterologous data of NASDAQ 100 companies, FinTMMBench offers three significant advantages.
arXiv Detail & Related papers (2025-03-07T07:13:59Z)
Quantifying Qualitative Insights: Leveraging LLMs to Market Predict [0.0]
This study addresses challenges by leveraging daily reports from securities firms to create high-quality contextual information. The reports are segmented into text-based key factors and combined with numerical data, such as price information, to form context sets. A crafted prompt is designed to assign scores to the key factors, converting qualitative insights into quantitative results.
arXiv Detail & Related papers (2024-11-13T07:45:40Z)
Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context.<n>We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters.<n>We propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark.
arXiv Detail & Related papers (2024-10-24T17:56:08Z)
Context Matters: Leveraging Contextual Features for Time Series Forecasting [2.9687381456164004]
We introduce ContextFormer, a novel plug-and-play method to surgically integrate multimodal contextual information into existing forecasting models.<n> ContextFormer effectively distills forecast-specific information from rich multimodal contexts, including categorical, continuous, time-varying, and even textual information.<n>It outperforms SOTA forecasting models by up to 30% on a range of real-world datasets spanning energy, traffic, environmental, and financial domains.
arXiv Detail & Related papers (2024-10-16T15:36:13Z)
Extracting Structured Insights from Financial News: An Augmented LLM Driven Approach [0.0]
This paper presents a novel approach to financial news processing that leverages Large Language Models (LLMs) We introduce a system that extracts relevant company tickers from raw news article content, performs sentiment analysis at the company level, and generates summaries. We are the first data provider to offer granular, per-company sentiment analysis from news articles, enhancing the depth of information available to market participants.
arXiv Detail & Related papers (2024-07-22T16:47:31Z)
AlphaFin: Benchmarking Financial Analysis with Retrieval-Augmented Stock-Chain Framework [48.3060010653088]
We release AlphaFin datasets, combining traditional research datasets, real-time financial data, and handwritten chain-of-thought (CoT) data. We then use AlphaFin datasets to benchmark a state-of-the-art method, called Stock-Chain, for effectively tackling the financial analysis task.
arXiv Detail & Related papers (2024-03-19T09:45:33Z)
Financial data analysis application via multi-strategy text processing [0.2741266294612776]
This paper mainly focuses on the stock trading data and news about China A-share companies. We present our efforts and plans in deep learning financial text processing application scenarios using natural language processing (NLP) and knowledge graph (KG) technologies.
arXiv Detail & Related papers (2022-04-25T01:56:36Z)
Gaussian process imputation of multiple financial series [71.08576457371433]
Multiple time series such as financial indicators, stock prices and exchange rates are strongly coupled due to their dependence on the latent state of the market. We focus on learning the relationships among financial time series by modelling them through a multi-output Gaussian process.
arXiv Detail & Related papers (2020-02-11T19:18:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.