Related papers: Fine-Tuning Vision-Language Models for Markdown Conversion of Financial Tables in Malaysian Audited Financial Reports

Fine-Tuning Vision-Language Models for Markdown Conversion of Financial Tables in Malaysian Audited Financial Reports

URL: http://arxiv.org/abs/2508.05669v1
Date: Mon, 04 Aug 2025 04:54:00 GMT
Title: Fine-Tuning Vision-Language Models for Markdown Conversion of Financial Tables in Malaysian Audited Financial Reports
Authors: Jin Khye Tan, En Jun Choong, Ethan Jeremiah Chitty, Yan Pheng Choo, John Hsin Yang Wong, Chern Eu Cheah,
Abstract summary: We propose a fine-tuned vision-language model (VLM) based on Qwen2.5-VL-7B.<n>Our approach includes a curated dataset of 2,152 image-text pairs with augmentations and a supervised fine-tuning strategy using LoRA.<n>Our model achieves a 92.20% overall accuracy on the criteria-based assessment and a 96.53% markdown TEDS score.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Accurately extracting and representing the structure of tabular data from financial documents remains a critical challenge in document understanding, particularly for regulatory and analytical use cases. This study addresses the complexity of converting financial tables from Malaysian audited financial reports into Markdown format, a task complicated by rotated layouts, multi-level headers, and implicit structural cues. We propose a fine-tuned vision-language model (VLM), based on Qwen2.5-VL-7B, optimized for high-fidelity Markdown generation from document images. Our approach includes a curated dataset of 2,152 image-text pairs with augmentations and a supervised fine-tuning strategy using LoRA. To assess performance, we evaluated our model on 100 out-of-sample tables using a dual framework: a criteria-based LLM-as-a-judge for fine-grained accuracy and our novel Markdown Tree-Edit-Distance-based Similarity (TEDS) metric for holistic structural fidelity. Our model achieves a 92.20% overall accuracy on the criteria-based assessment and a 96.53% Markdown TEDS score. This performance significantly surpasses its Qwen2.5-VL-7B base model, larger-scale VLMs, and specialized reasoning-enabled models. Compared to these self-hosted alternatives, it also significantly reduces inference time. Furthermore, its accuracy exceeds that of widely used proprietary models such as OpenAI's GPT-4o and Gemini 2.5 Flash. These results demonstrate that domain-specific fine-tuning provides an effective and efficient method to bridge the gap between unstructured financial documents and downstream automation, rivalling much larger and more general models without their computational overhead.

Related papers

When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents [3.4992819560032267]
Vision-language models (VLMs) perform well on many document understanding tasks, yet their reliability in specialized, non-English domains remains underexplored.<n>We introduce Multimodal Finance Eval, the first multimodal benchmark for evaluating French financial document understanding.<n>The dataset contains 1,204 expert-validated questions spanning text extraction, table comprehension, chart interpretation, and multi-turn conversational reasoning.
arXiv Detail & Related papers (2026-02-11T00:04:56Z)
Enhancing Business Analytics through Hybrid Summarization of Financial Reports [0.152292571922932]
Financial reports and earnings communications contain large volumes of structured and semi structured information.<n>We present a hybrid summarization framework that combines extractive and abstractive techniques to produce concise and factually reliable summaries.<n>These findings support the development of practical summarization systems for distilling lengthy financial texts into usable business insights.
arXiv Detail & Related papers (2025-12-28T16:25:12Z)
A Comparative Benchmark of Large Language Models for Labelling Wind Turbine Maintenance Logs [0.0]
This paper presents a framework for benchmarking Large Language Models (LLMs) on the task of classifying complex industrial records.<n>To promote transparency and encourage further research, this framework has been made publicly available as an open-source tool.<n>We quantify a clear performance hierarchy, identifying top models that exhibit high alignment with a benchmark standard and trustworthy, well-calibrated confidence scores.
arXiv Detail & Related papers (2025-09-08T15:48:17Z)
LimiX: Unleashing Structured-Data Modeling Capability for Generalist Intelligence [61.46575527504109]
LimiX-16M and LimiX-2M treat structured data as a joint distribution over variables and missingness.<n>We evaluate LimiX models across 11 large structured-data benchmarks with broad regimes of sample size, feature dimensionality, class number, categorical-to-numerical feature ratio, missingness, and sample-to-feature ratios.
arXiv Detail & Related papers (2025-09-03T17:39:08Z)
FinTagging: An LLM-ready Benchmark for Extracting and Structuring Financial Information [18.75906880569719]
We introduce FinTagging, the first full-scope, table-aware benchmark designed to evaluate the structured information extraction and semantic alignment capabilities of large language models (LLMs)<n>Unlike prior benchmarks that oversimplify tagging as flat multi-class classification and focus solely on narrative text, FinTagging decomposes the tagging problem into two subtasks: FinNI for financial entity extraction and FinCL for taxonomy-driven concept alignment.<n>It requires models to jointly extract facts and align them with the full 10k+ US- taxonomy across both unstructured text and structured tables, enabling realistic, fine-grained evaluation
arXiv Detail & Related papers (2025-05-27T02:55:53Z)
Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications [0.7124971549479361]
This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification.<n>We determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability.
arXiv Detail & Related papers (2025-05-20T21:12:58Z)
Financial Fraud Detection Using Explainable AI and Stacking Ensemble Methods [0.6642919568083927]
We propose a fraud detection framework that combines a stacking ensemble of gradient boosting models: XGBoost, LightGBM, and CatBoost.<n>XAI techniques are used to enhance the transparency and interpretability of the model's decisions.
arXiv Detail & Related papers (2025-05-15T07:53:02Z)
Protecting multimodal large language models against misleading visualizations [94.71976205962527]
We show that questionanswering (QA) accuracy on misleading visualizations drops on average to the level of the random baseline.<n>We introduce the first inference-time methods to improve QA performance on misleading visualizations, without compromising accuracy on non-misleading ones.<n>We find that two methods, table-based QA and redrawing the visualization, are effective, with improvements of up to 19.6 percentage points.
arXiv Detail & Related papers (2025-02-27T20:22:34Z)
REAL-MM-RAG: A Real-World Multi-Modal Retrieval Benchmark [16.55516587540082]
We introduce REAL-MM-RAG, an automatically generated benchmark designed to address four key properties essential for real-world retrieval.<n>We propose a multi-difficulty-level scheme based on query rephrasing to evaluate models' semantic understanding beyond keyword matching.<n>Our benchmark reveals significant model weaknesses, particularly in handling table-heavy documents and robustness to query rephrasing.
arXiv Detail & Related papers (2025-02-17T22:10:47Z)
Ranked from Within: Ranking Large Multimodal Models for Visual Question Answering Without Labels [64.94853276821992]
Large multimodal models (LMMs) are increasingly deployed across diverse applications.<n>Traditional evaluation methods are largely dataset-centric, relying on fixed, labeled datasets and supervised metrics.<n>We explore unsupervised model ranking for LMMs by leveraging their uncertainty signals, such as softmax probabilities.
arXiv Detail & Related papers (2024-12-09T13:05:43Z)
Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark [62.58869921806019]
We propose a task decomposition evaluation framework based on GPT-4o to automatically construct a new training dataset. We design innovative training strategies to effectively distill GPT-4o's evaluation capabilities into a 7B open-source MLLM, MiniCPM-V-2.6. Experimental results demonstrate that our distilled open-source MLLM significantly outperforms the current state-of-the-art GPT-4o-base baseline.
arXiv Detail & Related papers (2024-11-23T08:06:06Z)
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers.<n>LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs.<n>We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z)
Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs [49.57641083688934]
We introduce a novel approach to anomaly detection in financial data using Large Language Models (LLMs) embeddings. Our experiments demonstrate that LLMs contribute valuable information to anomaly detection as our models outperform the baselines.
arXiv Detail & Related papers (2024-06-05T20:19:09Z)
Parameter-Efficient Instruction Tuning of Large Language Models For Extreme Financial Numeral Labelling [29.84946857859386]
We study the problem of automatically annotating relevant numerals occurring in the financial documents with their corresponding tags. We propose a parameter efficient solution for the task using LoRA. Our proposed model, FLAN-FinXC, achieves new state-of-the-art performances on both the datasets.
arXiv Detail & Related papers (2024-05-03T16:41:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.