FinMTM: A Multi-Turn Multimodal Benchmark for Financial Reasoning and Agent Evaluation
- URL: http://arxiv.org/abs/2602.03130v1
- Date: Tue, 03 Feb 2026 05:38:24 GMT
- Title: FinMTM: A Multi-Turn Multimodal Benchmark for Financial Reasoning and Agent Evaluation
- Authors: Chenxi Zhang, Ziliang Gan, Liyun Zhu, Youwei Pang, Qing Zhang, Rongjunchen Zhang,
- Abstract summary: FinMTM is a multi-turn multimodal benchmark that expands diversity along both data and task dimensions.<n>On the data side, we curate and annotate 11,133 bilingual (Chinese and English) financial QA pairs grounded in financial visuals.<n>On the task side, FinMTM covers single- and multiple-choice questions, multi-turn open-ended dialogues, and agent-based tasks.
- Score: 15.654001393123403
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The financial domain poses substantial challenges for vision-language models (VLMs) due to specialized chart formats and knowledge-intensive reasoning requirements. However, existing financial benchmarks are largely single-turn and rely on a narrow set of question formats, limiting comprehensive evaluation in realistic application scenarios. To address this gap, we propose FinMTM, a multi-turn multimodal benchmark that expands diversity along both data and task dimensions. On the data side, we curate and annotate 11{,}133 bilingual (Chinese and English) financial QA pairs grounded in financial visuals, including candlestick charts, statistical plots, and report figures. On the task side, FinMTM covers single- and multiple-choice questions, multi-turn open-ended dialogues, and agent-based tasks. We further design task-specific evaluation protocols, including a set-overlap scoring rule for multiple-choice questions, a weighted combination of turn-level and session-level scores for multi-turn dialogues, and a composite metric that integrates planning quality with final outcomes for agent tasks. Extensive experimental evaluation of 22 VLMs reveal their limitations in fine-grained visual perception, long-context reasoning, and complex agent workflows.
Related papers
- The CLEF-2026 FinMMEval Lab: Multilingual and Multimodal Evaluation of Financial AI Systems [54.12165004393043]
FinMMEval 2026 offers three interconnected tasks that span financial understanding, reasoning, and decision-making.<n>The lab aims to promote the development of robust, transparent, and globally inclusive financial AI systems.
arXiv Detail & Related papers (2026-02-11T14:14:06Z) - When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents [3.4992819560032267]
Vision-language models (VLMs) perform well on many document understanding tasks, yet their reliability in specialized, non-English domains remains underexplored.<n>We introduce Multimodal Finance Eval, the first multimodal benchmark for evaluating French financial document understanding.<n>The dataset contains 1,204 expert-validated questions spanning text extraction, table comprehension, chart interpretation, and multi-turn conversational reasoning.
arXiv Detail & Related papers (2026-02-11T00:04:56Z) - UniFinEval: Towards Unified Evaluation of Financial Multimodal Models across Text, Images and Videos [22.530796761115766]
We propose UniFinEval, the first unified multimodal benchmark for high-information-density financial environments.<n>UniFinEval systematically constructs five core financial scenarios grounded in real-world financial systems.<n> Gemini-3-pro-preview achieves the best overall performance, yet still exhibits a substantial gap compared to financial experts.
arXiv Detail & Related papers (2026-01-09T10:15:32Z) - FinSight: Towards Real-World Financial Deep Research [68.31086471310773]
FinSight is a novel framework for producing high-quality, multimodal financial reports.<n>To ensure professional-grade visualization, we propose an Iterative Vision-Enhanced Mechanism.<n>A two-stage Writing Framework expands concise Chain-of-Analysis segments into coherent, citation-aware, and multimodal reports.
arXiv Detail & Related papers (2025-10-19T14:05:35Z) - FinMR: A Knowledge-Intensive Multimodal Benchmark for Advanced Financial Reasoning [10.985136487771364]
FinMR is a knowledge-intensive multimodal dataset designed to evaluate expert-level financial reasoning capabilities at a professional analyst's standard.<n>It comprises over 3,200 meticulously curated and expertly annotated question-answer pairs across 15 diverse financial topics.<n>FinMR establishes itself as an essential benchmark tool for assessing and advancing multimodal financial reasoning toward professional analyst-level competence.
arXiv Detail & Related papers (2025-10-09T06:49:55Z) - FinMMR: Make Financial Numerical Reasoning More Multimodal, Comprehensive, and Challenging [12.897569424944107]
FinMMR is a novel bilingual benchmark tailored to evaluate the reasoning capabilities of multimodal large language models (MLLMs) in financial numerical reasoning tasks.<n>FinMMR comprises 4.3K questions and 8.7K images spanning 14 categories, including tables, bar charts, and ownership structure charts.
arXiv Detail & Related papers (2025-08-06T16:51:09Z) - MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation [80.08991479306681]
MEXA is a training-free framework that performs modality- and task-aware aggregation of expert models.<n>We evaluate our approach on diverse multimodal benchmarks, including Video Reasoning, Audio Reasoning, 3D Understanding, and Medical QA.
arXiv Detail & Related papers (2025-06-20T16:14:13Z) - MultiFinBen: Benchmarking Large Language Models for Multilingual and Multimodal Financial Application [118.63802040274999]
MultiFinBen is the first expert-annotated multilingual (five languages) and multimodal benchmark for evaluating LLMs in realistic financial contexts.<n>Financial reasoning tests cross-lingual evidence integration from filings and news, and financial OCR, which extracts structured text from scanned documents.<n> evaluating 21 leading LLMs shows that even frontier multimodal models like GPT-4o achieve only 46.01% overall, stronger on vision and audio but dropping sharply in multilingual settings.
arXiv Detail & Related papers (2025-06-16T22:01:49Z) - CFBenchmark-MM: Chinese Financial Assistant Benchmark for Multimodal Large Language Model [21.702901343472558]
Multimodal Large Language Models (MLLMs) have rapidly evolved with the growth of Large Language Models (LLMs)<n>In this paper, we introduce CFBenchmark-MM, a Chinese multimodal financial benchmark with over 9,000 image-question pairs featuring tables, histogram charts, line charts, pie charts, and structural diagrams.<n>We develop a staged evaluation system to assess MLLMs in handling multimodal information by providing different visual content step by step.
arXiv Detail & Related papers (2025-06-16T02:52:44Z) - Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications [88.96861155804935]
We introduce textitOpen-FinLLMs, the first open-source multimodal financial LLMs.<n>FinLLaMA is pre-trained on a comprehensive 52-billion-token corpus; FinLLaMA-Instruct, fine-tuned with 573K financial instructions; and FinLLaVA, enhanced with 1.43M multimodal tuning pairs.<n>We evaluate Open-FinLLMs across 14 financial tasks, 30 datasets, and 4 multimodal tasks in zero-shot, few-shot, and supervised fine-tuning settings.
arXiv Detail & Related papers (2024-08-20T16:15:28Z) - PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark
for Finance [63.51545277822702]
PIXIU is a comprehensive framework including the first financial large language model (LLMs) based on fine-tuning LLaMA with instruction data.
We propose FinMA by fine-tuning LLaMA with the constructed dataset to be able to follow instructions for various financial tasks.
We conduct a detailed analysis of FinMA and several existing LLMs, uncovering their strengths and weaknesses in handling critical financial tasks.
arXiv Detail & Related papers (2023-06-08T14:20:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.