FinTagging: An LLM-ready Benchmark for Extracting and Structuring Financial Information
- URL: http://arxiv.org/abs/2505.20650v1
- Date: Tue, 27 May 2025 02:55:53 GMT
- Title: FinTagging: An LLM-ready Benchmark for Extracting and Structuring Financial Information
- Authors: Yan Wang, Yang Ren, Lingfei Qian, Xueqing Peng, Keyi Wang, Yi Han, Dongji Feng, Xiao-Yang Liu, Jimin Huang, Qianqian Xie,
- Abstract summary: We introduce FinTagging, the first full-scope, table-aware benchmark designed to evaluate the structured information extraction and semantic alignment capabilities of large language models (LLMs)<n>Unlike prior benchmarks that oversimplify tagging as flat multi-class classification and focus solely on narrative text, FinTagging decomposes the tagging problem into two subtasks: FinNI for financial entity extraction and FinCL for taxonomy-driven concept alignment.<n>It requires models to jointly extract facts and align them with the full 10k+ US- taxonomy across both unstructured text and structured tables, enabling realistic, fine-grained evaluation
- Score: 18.75906880569719
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce FinTagging, the first full-scope, table-aware XBRL benchmark designed to evaluate the structured information extraction and semantic alignment capabilities of large language models (LLMs) in the context of XBRL-based financial reporting. Unlike prior benchmarks that oversimplify XBRL tagging as flat multi-class classification and focus solely on narrative text, FinTagging decomposes the XBRL tagging problem into two subtasks: FinNI for financial entity extraction and FinCL for taxonomy-driven concept alignment. It requires models to jointly extract facts and align them with the full 10k+ US-GAAP taxonomy across both unstructured text and structured tables, enabling realistic, fine-grained evaluation. We assess a diverse set of LLMs under zero-shot settings, systematically analyzing their performance on both subtasks and overall tagging accuracy. Our results reveal that, while LLMs demonstrate strong generalization in information extraction, they struggle with fine-grained concept alignment, particularly in disambiguating closely related taxonomy entries. These findings highlight the limitations of existing LLMs in fully automating XBRL tagging and underscore the need for improved semantic reasoning and schema-aware modeling to meet the demands of accurate financial disclosure. Code is available at our GitHub repository and data is at our Hugging Face repository.
Related papers
- Structure First, Reason Next: Enhancing a Large Language Model using Knowledge Graph for Numerical Reasoning in Financial Documents [0.21485350418225244]
Large Language Models (LLMs) have shown promising results in multiple Question-Answering (Q-A) systems.<n>Structured data augmentations, such as Knowledge Graphs (KGs), have notably improved the predictions of LLMs.<n>This paper proposes a framework to incorporate structured information using KGs along with LLM predictions for numerical reasoning tasks.
arXiv Detail & Related papers (2026-01-12T17:39:08Z) - Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding [61.36285696607487]
Document understanding is critical for applications from financial analysis to scientific discovery.<n>Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs) face key limitations.<n>Retrieval-Augmented Generation (RAG) helps ground models in external data, but documents' multimodal nature, combining text, tables, charts, and layout, demands a more advanced paradigm: Multimodal RAG.
arXiv Detail & Related papers (2025-10-17T02:33:16Z) - FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs [40.216867348210265]
FinAuditing is the first taxonomy-aligned, structure-aware, multi-document benchmark for evaluating financial auditing tasks.<n>Built from real US-compliant.<n> filings, FinAuditing defines three complementary subtasks, FinSM for semantic consistency, FinRE for relational consistency, and FinMR for numerical consistency.<n>Extensive zero-shot experiments on 13 state-of-the-art LLMs reveal that current models perform inconsistently across semantic, relational, and mathematical dimensions.
arXiv Detail & Related papers (2025-10-10T00:41:55Z) - FinAgentBench: A Benchmark Dataset for Agentic Retrieval in Financial Question Answering [57.18367828883773]
FinAgentBench is a benchmark for evaluating agentic retrieval with multi-step reasoning in finance.<n>The benchmark consists of 26K expert-annotated examples on S&P-500 listed firms.<n>We evaluate a suite of state-of-the-art models and demonstrate how targeted fine-tuning can significantly improve agentic retrieval performance.
arXiv Detail & Related papers (2025-08-07T22:15:22Z) - FAITH: A Framework for Assessing Intrinsic Tabular Hallucinations in finance [0.06597195879147556]
Hallucination remains a critical challenge for deploying Large Language Models (LLMs) in finance.<n>We develop a rigorous and scalable framework for evaluating intrinsic hallucinations in financial LLMs.<n>Our work serves as a critical step toward building more trustworthy and reliable financial Generative AI systems.
arXiv Detail & Related papers (2025-08-07T09:37:14Z) - Fine-Tuning Vision-Language Models for Markdown Conversion of Financial Tables in Malaysian Audited Financial Reports [0.0]
We propose a fine-tuned vision-language model (VLM) based on Qwen2.5-VL-7B.<n>Our approach includes a curated dataset of 2,152 image-text pairs with augmentations and a supervised fine-tuning strategy using LoRA.<n>Our model achieves a 92.20% overall accuracy on the criteria-based assessment and a 96.53% markdown TEDS score.
arXiv Detail & Related papers (2025-08-04T04:54:00Z) - Beyond Isolated Dots: Benchmarking Structured Table Construction as Deep Knowledge Extraction [80.88654868264645]
Arranged and Organized Extraction Benchmark designed to evaluate ability of large language models to comprehend fragmented documents.<n>AOE includes 11 carefully crafted tasks across three diverse domains, requiring models to generate context-specific schema tailored to varied input queries.<n>Results show that even the most advanced models struggled significantly.
arXiv Detail & Related papers (2025-07-22T06:37:51Z) - MultiFinBen: Benchmarking Large Language Models for Multilingual and Multimodal Financial Application [118.63802040274999]
MultiFinBen is the first expert-annotated multilingual (five languages) and multimodal benchmark for evaluating LLMs in realistic financial contexts.<n>Financial reasoning tests cross-lingual evidence integration from filings and news, and financial OCR, which extracts structured text from scanned documents.<n> evaluating 21 leading LLMs shows that even frontier multimodal models like GPT-4o achieve only 46.01% overall, stronger on vision and audio but dropping sharply in multilingual settings.
arXiv Detail & Related papers (2025-06-16T22:01:49Z) - Representation Learning of Limit Order Book: A Comprehensive Study and Benchmarking [3.94375691568608]
Limit Order Book (LOB) provides a fine-grained view of market dynamics.<n>Existing approaches often tightly couple representation learning with specific downstream tasks in an end-to-end manner.<n>We introduce LOBench, a standardized benchmark with real China A-share market data, offering curated datasets, unified preprocessing, consistent evaluation metrics, and strong baselines.
arXiv Detail & Related papers (2025-05-04T15:00:00Z) - HiFi-KPI: A Dataset for Hierarchical KPI Extraction from Earnings Filings [42.63642722062992]
We introduce the Hierarchical Financial Key Performance Indicator (HiFi-KPI) dataset.<n>Our approach organizes a 218,126-label hierarchy using a taxonomy based grouping method.<n>We additionally release HiFi-KPI Lite, a manually curated subset with four expert-mapped labels.
arXiv Detail & Related papers (2025-02-21T12:19:08Z) - Latent Factor Models Meets Instructions: Goal-conditioned Latent Factor Discovery without Task Supervision [50.45597801390757]
Instruct-LF is a goal-oriented latent factor discovery system.<n>It integrates instruction-following ability with statistical models to handle noisy datasets.
arXiv Detail & Related papers (2025-02-21T02:03:08Z) - KG-CF: Knowledge Graph Completion with Context Filtering under the Guidance of Large Language Models [55.39134076436266]
KG-CF is a framework tailored for ranking-based knowledge graph completion tasks.<n> KG-CF leverages LLMs' reasoning abilities to filter out irrelevant contexts, achieving superior results on real-world datasets.
arXiv Detail & Related papers (2025-01-06T01:52:15Z) - Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z) - Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs [49.57641083688934]
We introduce a novel approach to anomaly detection in financial data using Large Language Models (LLMs) embeddings.
Our experiments demonstrate that LLMs contribute valuable information to anomaly detection as our models outperform the baselines.
arXiv Detail & Related papers (2024-06-05T20:19:09Z) - NIFTY Financial News Headlines Dataset [14.622656548420073]
The NIFTY Financial News Headlines dataset is designed to facilitate and advance research in financial market forecasting using large language models (LLMs)
This dataset comprises two distinct versions tailored for different modeling approaches: (i) NIFTY-LM, which targets supervised fine-tuning (SFT) of LLMs with an auto-regressive, causal language-modeling objective, and (ii) NIFTY-RL, formatted specifically for alignment methods (like reinforcement learning from human feedback) to align LLMs via rejection sampling and reward modeling.
arXiv Detail & Related papers (2024-05-16T01:09:33Z) - Parameter-Efficient Instruction Tuning of Large Language Models For Extreme Financial Numeral Labelling [29.84946857859386]
We study the problem of automatically annotating relevant numerals occurring in the financial documents with their corresponding tags.
We propose a parameter efficient solution for the task using LoRA.
Our proposed model, FLAN-FinXC, achieves new state-of-the-art performances on both the datasets.
arXiv Detail & Related papers (2024-05-03T16:41:36Z) - SEED-Bench-2: Benchmarking Multimodal Large Language Models [67.28089415198338]
Multimodal large language models (MLLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs.
SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions.
We evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations.
arXiv Detail & Related papers (2023-11-28T05:53:55Z) - Data-Centric Financial Large Language Models [27.464319154543173]
Large language models (LLMs) show promise for natural language tasks but struggle when applied directly to complex domains like finance.
We propose a data-centric approach to enable LLMs to better handle financial tasks.
arXiv Detail & Related papers (2023-10-07T04:53:31Z) - Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes [54.13559879916708]
EVAPORATE is a prototype system powered by large language models (LLMs)<n>Code synthesis is cheap, but far less accurate than directly processing each document with the LLM.<n>We propose an extended code implementation, EVAPORATE-CODE+, which achieves better quality than direct extraction.
arXiv Detail & Related papers (2023-04-19T06:00:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.