Related papers: When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents

When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents

URL: http://arxiv.org/abs/2602.10384v2
Date: Thu, 12 Feb 2026 20:41:46 GMT
Title: When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents
Authors: Virginie Mouilleron, Théo Lasnier, Djamé Seddah,
Abstract summary: Vision-language models (VLMs) perform well on many document understanding tasks, yet their reliability in specialized, non-English domains remains underexplored.<n>We introduce Multimodal Finance Eval, the first multimodal benchmark for evaluating French financial document understanding.<n>The dataset contains 1,204 expert-validated questions spanning text extraction, table comprehension, chart interpretation, and multi-turn conversational reasoning.
Score: 3.4992819560032267
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-language models (VLMs) perform well on many document understanding tasks, yet their reliability in specialized, non-English domains remains underexplored. This gap is especially critical in finance, where documents mix dense regulatory text, numerical tables, and visual charts, and where extraction errors can have real-world consequences. We introduce Multimodal Finance Eval, the first multimodal benchmark for evaluating French financial document understanding. The dataset contains 1,204 expert-validated questions spanning text extraction, table comprehension, chart interpretation, and multi-turn conversational reasoning, drawn from real investment prospectuses, KIDs, and PRIIPs. We evaluate six open-weight VLMs (8B-124B parameters) using an LLM-as-judge protocol. While models achieve strong performance on text and table tasks (85-90% accuracy), they struggle with chart interpretation (34-62%). Most notably, multi-turn dialogue reveals a sharp failure mode: early mistakes propagate across turns, driving accuracy down to roughly 50% regardless of model size. These results show that current VLMs are effective for well-defined extraction tasks but remain brittle in interactive, multi-step financial analysis. Multimodal Finance Eval offers a challenging benchmark to measure and drive progress in this high-stakes setting.

Related papers

The CLEF-2026 FinMMEval Lab: Multilingual and Multimodal Evaluation of Financial AI Systems [54.12165004393043]
FinMMEval 2026 offers three interconnected tasks that span financial understanding, reasoning, and decision-making.<n>The lab aims to promote the development of robust, transparent, and globally inclusive financial AI systems.
arXiv Detail & Related papers (2026-02-11T14:14:06Z)
FinMMDocR: Benchmarking Financial Multimodal Reasoning with Scenario Awareness, Document Understanding, and Multi-Step Computation [27.697631967262662]
FinMMDocR is a novel benchmark for evaluating multimodal large language models (MLLMs) on real-world financial numerical reasoning.<n>Compared to existing benchmarks, our work delivers three major advancements.
arXiv Detail & Related papers (2025-12-31T15:00:03Z)
XFinBench: Benchmarking LLMs in Complex Financial Problem Solving and Reasoning [28.967959142733903]
We introduce XFinBench, a novel benchmark to evaluate large language models' ability in solving financial problems.<n>O1 is the best-performing text-only model with an overall accuracy of 67.3%, but still lags significantly behind human experts with 12.5%.<n>We construct a knowledge bank with 3,032 finance terms for knowledge augmentation analysis, and find that relevant knowledge only brings consistent accuracy improvements to small open-source model.
arXiv Detail & Related papers (2025-08-20T15:23:35Z)
MultiFinBen: Benchmarking Large Language Models for Multilingual and Multimodal Financial Application [118.63802040274999]
MultiFinBen is the first expert-annotated multilingual (five languages) and multimodal benchmark for evaluating LLMs in realistic financial contexts.<n>Financial reasoning tests cross-lingual evidence integration from filings and news, and financial OCR, which extracts structured text from scanned documents.<n> evaluating 21 leading LLMs shows that even frontier multimodal models like GPT-4o achieve only 46.01% overall, stronger on vision and audio but dropping sharply in multilingual settings.
arXiv Detail & Related papers (2025-06-16T22:01:49Z)
CFBenchmark-MM: Chinese Financial Assistant Benchmark for Multimodal Large Language Model [21.702901343472558]
Multimodal Large Language Models (MLLMs) have rapidly evolved with the growth of Large Language Models (LLMs)<n>In this paper, we introduce CFBenchmark-MM, a Chinese multimodal financial benchmark with over 9,000 image-question pairs featuring tables, histogram charts, line charts, pie charts, and structural diagrams.<n>We develop a staged evaluation system to assess MLLMs in handling multimodal information by providing different visual content step by step.
arXiv Detail & Related papers (2025-06-16T02:52:44Z)
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications [88.96861155804935]
We introduce textitOpen-FinLLMs, the first open-source multimodal financial LLMs.<n>FinLLaMA is pre-trained on a comprehensive 52-billion-token corpus; FinLLaMA-Instruct, fine-tuned with 573K financial instructions; and FinLLaVA, enhanced with 1.43M multimodal tuning pairs.<n>We evaluate Open-FinLLMs across 14 financial tasks, 30 datasets, and 4 multimodal tasks in zero-shot, few-shot, and supervised fine-tuning settings.
arXiv Detail & Related papers (2024-08-20T16:15:28Z)
Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs [49.57641083688934]
We introduce a novel approach to anomaly detection in financial data using Large Language Models (LLMs) embeddings. Our experiments demonstrate that LLMs contribute valuable information to anomaly detection as our models outperform the baselines.
arXiv Detail & Related papers (2024-06-05T20:19:09Z)
FinBen: A Holistic Financial Benchmark for Large Language Models [75.09474986283394]
FinBen is the first extensive open-source evaluation benchmark, including 36 datasets spanning 24 financial tasks. FinBen offers several key innovations: a broader range of tasks and datasets, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluation, and three novel open-source evaluation datasets for text summarization, question answering, and stock trading.
arXiv Detail & Related papers (2024-02-20T02:16:16Z)
PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance [63.51545277822702]
PIXIU is a comprehensive framework including the first financial large language model (LLMs) based on fine-tuning LLaMA with instruction data. We propose FinMA by fine-tuning LLaMA with the constructed dataset to be able to follow instructions for various financial tasks. We conduct a detailed analysis of FinMA and several existing LLMs, uncovering their strengths and weaknesses in handling critical financial tasks.
arXiv Detail & Related papers (2023-06-08T14:20:29Z)
Model-Agnostic Meta-Learning for Natural Language Understanding Tasks in Finance [1.863067234952186]
We investigate model-agnostic meta-learning algorithm(MAML) in low-resource financial NLU tasks. Our models achieve the state-of-the-art performance according to the experimental results.
arXiv Detail & Related papers (2023-03-06T02:24:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.