Related papers: FinMTM: A Multi-Turn Multimodal Benchmark for Financial Reasoning and Agent Evaluation

FinMTM: A Multi-Turn Multimodal Benchmark for Financial Reasoning and Agent Evaluation

URL: http://arxiv.org/abs/2602.03130v1
Date: Tue, 03 Feb 2026 05:38:24 GMT
Title: FinMTM: A Multi-Turn Multimodal Benchmark for Financial Reasoning and Agent Evaluation
Authors: Chenxi Zhang, Ziliang Gan, Liyun Zhu, Youwei Pang, Qing Zhang, Rongjunchen Zhang,
Abstract summary: FinMTM is a multi-turn multimodal benchmark that expands diversity along both data and task dimensions.<n>On the data side, we curate and annotate 11,133 bilingual (Chinese and English) financial QA pairs grounded in financial visuals.<n>On the task side, FinMTM covers single- and multiple-choice questions, multi-turn open-ended dialogues, and agent-based tasks.
Score: 15.654001393123403
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The financial domain poses substantial challenges for vision-language models (VLMs) due to specialized chart formats and knowledge-intensive reasoning requirements. However, existing financial benchmarks are largely single-turn and rely on a narrow set of question formats, limiting comprehensive evaluation in realistic application scenarios. To address this gap, we propose FinMTM, a multi-turn multimodal benchmark that expands diversity along both data and task dimensions. On the data side, we curate and annotate 11{,}133 bilingual (Chinese and English) financial QA pairs grounded in financial visuals, including candlestick charts, statistical plots, and report figures. On the task side, FinMTM covers single- and multiple-choice questions, multi-turn open-ended dialogues, and agent-based tasks. We further design task-specific evaluation protocols, including a set-overlap scoring rule for multiple-choice questions, a weighted combination of turn-level and session-level scores for multi-turn dialogues, and a composite metric that integrates planning quality with final outcomes for agent tasks. Extensive experimental evaluation of 22 VLMs reveal their limitations in fine-grained visual perception, long-context reasoning, and complex agent workflows.

Related papers

The CLEF-2026 FinMMEval Lab: Multilingual and Multimodal Evaluation of Financial AI Systems [54.12165004393043]
FinMMEval 2026 offers three interconnected tasks that span financial understanding, reasoning, and decision-making.<n>The lab aims to promote the development of robust, transparent, and globally inclusive financial AI systems.
arXiv Detail & Related papers (2026-02-11T14:14:06Z)
When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents [3.4992819560032267]
Vision-language models (VLMs) perform well on many document understanding tasks, yet their reliability in specialized, non-English domains remains underexplored.<n>We introduce Multimodal Finance Eval, the first multimodal benchmark for evaluating French financial document understanding.<n>The dataset contains 1,204 expert-validated questions spanning text extraction, table comprehension, chart interpretation, and multi-turn conversational reasoning.
arXiv Detail & Related papers (2026-02-11T00:04:56Z)
UniFinEval: Towards Unified Evaluation of Financial Multimodal Models across Text, Images and Videos [22.530796761115766]
We propose UniFinEval, the first unified multimodal benchmark for high-information-density financial environments.<n>UniFinEval systematically constructs five core financial scenarios grounded in real-world financial systems.<n> Gemini-3-pro-preview achieves the best overall performance, yet still exhibits a substantial gap compared to financial experts.
arXiv Detail & Related papers (2026-01-09T10:15:32Z)
FinSight: Towards Real-World Financial Deep Research [68.31086471310773]
FinSight is a novel framework for producing high-quality, multimodal financial reports.<n>To ensure professional-grade visualization, we propose an Iterative Vision-Enhanced Mechanism.<n>A two-stage Writing Framework expands concise Chain-of-Analysis segments into coherent, citation-aware, and multimodal reports.
arXiv Detail & Related papers (2025-10-19T14:05:35Z)
FinMR: A Knowledge-Intensive Multimodal Benchmark for Advanced Financial Reasoning [10.985136487771364]
FinMR is a knowledge-intensive multimodal dataset designed to evaluate expert-level financial reasoning capabilities at a professional analyst's standard.<n>It comprises over 3,200 meticulously curated and expertly annotated question-answer pairs across 15 diverse financial topics.<n>FinMR establishes itself as an essential benchmark tool for assessing and advancing multimodal financial reasoning toward professional analyst-level competence.
arXiv Detail & Related papers (2025-10-09T06:49:55Z)
FinMMR: Make Financial Numerical Reasoning More Multimodal, Comprehensive, and Challenging [12.897569424944107]
FinMMR is a novel bilingual benchmark tailored to evaluate the reasoning capabilities of multimodal large language models (MLLMs) in financial numerical reasoning tasks.<n>FinMMR comprises 4.3K questions and 8.7K images spanning 14 categories, including tables, bar charts, and ownership structure charts.
arXiv Detail & Related papers (2025-08-06T16:51:09Z)
MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation [80.08991479306681]
MEXA is a training-free framework that performs modality- and task-aware aggregation of expert models.<n>We evaluate our approach on diverse multimodal benchmarks, including Video Reasoning, Audio Reasoning, 3D Understanding, and Medical QA.
arXiv Detail & Related papers (2025-06-20T16:14:13Z)
MultiFinBen: Benchmarking Large Language Models for Multilingual and Multimodal Financial Application [118.63802040274999]
MultiFinBen is the first expert-annotated multilingual (five languages) and multimodal benchmark for evaluating LLMs in realistic financial contexts.<n>Financial reasoning tests cross-lingual evidence integration from filings and news, and financial OCR, which extracts structured text from scanned documents.<n> evaluating 21 leading LLMs shows that even frontier multimodal models like GPT-4o achieve only 46.01% overall, stronger on vision and audio but dropping sharply in multilingual settings.
arXiv Detail & Related papers (2025-06-16T22:01:49Z)
CFBenchmark-MM: Chinese Financial Assistant Benchmark for Multimodal Large Language Model [21.702901343472558]
Multimodal Large Language Models (MLLMs) have rapidly evolved with the growth of Large Language Models (LLMs)<n>In this paper, we introduce CFBenchmark-MM, a Chinese multimodal financial benchmark with over 9,000 image-question pairs featuring tables, histogram charts, line charts, pie charts, and structural diagrams.<n>We develop a staged evaluation system to assess MLLMs in handling multimodal information by providing different visual content step by step.
arXiv Detail & Related papers (2025-06-16T02:52:44Z)
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications [88.96861155804935]
We introduce textitOpen-FinLLMs, the first open-source multimodal financial LLMs.<n>FinLLaMA is pre-trained on a comprehensive 52-billion-token corpus; FinLLaMA-Instruct, fine-tuned with 573K financial instructions; and FinLLaVA, enhanced with 1.43M multimodal tuning pairs.<n>We evaluate Open-FinLLMs across 14 financial tasks, 30 datasets, and 4 multimodal tasks in zero-shot, few-shot, and supervised fine-tuning settings.
arXiv Detail & Related papers (2024-08-20T16:15:28Z)
PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance [63.51545277822702]
PIXIU is a comprehensive framework including the first financial large language model (LLMs) based on fine-tuning LLaMA with instruction data. We propose FinMA by fine-tuning LLaMA with the constructed dataset to be able to follow instructions for various financial tasks. We conduct a detailed analysis of FinMA and several existing LLMs, uncovering their strengths and weaknesses in handling critical financial tasks.
arXiv Detail & Related papers (2023-06-08T14:20:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.