FCMR: Robust Evaluation of Financial Cross-Modal Multi-Hop Reasoning
- URL: http://arxiv.org/abs/2412.12567v2
- Date: Wed, 19 Feb 2025 08:59:40 GMT
- Title: FCMR: Robust Evaluation of Financial Cross-Modal Multi-Hop Reasoning
- Authors: Seunghee Kim, Changhyeon Kim, Taeuk Kim,
- Abstract summary: We present Financial Cross-Modal Multi-Hop Reasoning (FCMR), a benchmark to analyze the reasoning capabilities of MLLMs.
FCMR is categorized into three difficulty levels-Easy, Medium, and Hard-facilitating a step-by-step evaluation.
Experiments on this new benchmark reveal that even state-of-the-art MLLMs struggle, with the best-performing model achieving only 30.4% accuracy on the most challenging tier.
- Score: 5.65203350495478
- License:
- Abstract: Real-world decision-making often requires integrating and reasoning over information from multiple modalities. While recent multimodal large language models (MLLMs) have shown promise in such tasks, their ability to perform multi-hop reasoning across diverse sources remains insufficiently evaluated. Existing benchmarks, such as MMQA, face challenges due to (1) data contamination and (2) a lack of complex queries that necessitate operations across more than two modalities, hindering accurate performance assessment. To address this, we present Financial Cross-Modal Multi-Hop Reasoning (FCMR), a benchmark created to analyze the reasoning capabilities of MLLMs by urging them to combine information from textual reports, tables, and charts within the financial domain. FCMR is categorized into three difficulty levels-Easy, Medium, and Hard-facilitating a step-by-step evaluation. In particular, problems at the Hard level require precise cross-modal three-hop reasoning and are designed to prevent the disregard of any modality. Experiments on this new benchmark reveal that even state-of-the-art MLLMs struggle, with the best-performing model (Claude 3.5 Sonnet) achieving only 30.4% accuracy on the most challenging tier. We also conduct analysis to provide insights into the inner workings of the models, including the discovery of a critical bottleneck in the information retrieval phase.
Related papers
- MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency [63.23935582919081]
Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs)
We introduce MME-CoT, a specialized benchmark evaluating the CoT reasoning performance of LMMs.
We conduct an in-depth analysis of state-of-the-art LMMs, uncovering several key insights.
arXiv Detail & Related papers (2025-02-13T18:59:46Z) - PAL: Prompting Analytic Learning with Missing Modality for Multi-Modal Class-Incremental Learning [42.00851701431368]
Multi-modal class-incremental learning (MMCIL) seeks to leverage multi-modal data, such as audio-visual and image-text pairs.
A critical challenge remains: the issue of missing modalities during incremental learning phases.
We propose PAL, a novel exemplar-free framework tailored to MMCIL under missing-modality scenarios.
arXiv Detail & Related papers (2025-01-16T08:04:04Z) - ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection [60.297079601066784]
We introduce ErrorRadar, the first benchmark designed to assess MLLMs' capabilities in error detection.
ErrorRadar evaluates two sub-tasks: error step identification and error categorization.
It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions.
Results indicate significant challenges still remain, as GPT-4o with best performance is still around 10% behind human evaluation.
arXiv Detail & Related papers (2024-10-06T14:59:09Z) - Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language Models [12.841405829775852]
We introduce the modality importance score (MIS) to identify bias inVidQA benchmarks and datasets.
We also propose an innovative method using state-of-the-art MLLMs to estimate the modality importance.
Our results indicate that current models do not effectively integrate information due to modality imbalance in existing datasets.
arXiv Detail & Related papers (2024-08-22T23:32:42Z) - CatMemo at the FinLLM Challenge Task: Fine-Tuning Large Language Models using Data Fusion in Financial Applications [10.225210627594894]
This paper presents our solution to IJCAI-2024 FinLLM challenge, investigating the capabilities of LLMs within three critical areas of financial tasks.
Financial classification, financial text summarization, and single stock trading are investigated.
Our approach aims to tackle these diverse tasks in a comprehensive and integrated manner, showcasing LLMs' capacity to address diverse and complex financial tasks with improved accuracy and decision-making capabilities.
arXiv Detail & Related papers (2024-07-02T05:04:13Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.
We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.
Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - MMRel: A Relation Understanding Benchmark in the MLLM Era [72.95901753186227]
Multi-Modal Relation Understanding (MMRel) is a benchmark that features large-scale, high-quality, and diverse data on inter-object relations.
MMRel is ideal for evaluating MLLMs on relation understanding, as well as for fine-tuning MLLMs to enhance relation comprehension capability.
arXiv Detail & Related papers (2024-06-13T13:51:59Z) - MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models [51.19622266249408]
MultiTrust is the first comprehensive and unified benchmark on the trustworthiness of MLLMs.
Our benchmark employs a rigorous evaluation strategy that addresses both multimodal risks and cross-modal impacts.
Extensive experiments with 21 modern MLLMs reveal some previously unexplored trustworthiness issues and risks.
arXiv Detail & Related papers (2024-06-11T08:38:13Z) - Stock Movement Prediction with Multimodal Stable Fusion via Gated Cross-Attention Mechanism [11.6870352823637]
This study introduces a novel architecture, named Multimodal Stable Fusion with Gated Cross-Attention (MSGCA), designed to robustly integrate multimodal input for stock movement prediction.
MSGCA framework consists of three integral components: (1) a trimodal encoding module, responsible for processing indicator sequences, dynamic documents, and a relational graph, and standardizing their feature representations; (2) a cross-feature fusion module, where primary and consistent features guide the multimodal fusion of the three modalities via a pair of gated cross-attention networks; and (3) a prediction module, which refines the fused features through temporal and dimensional reduction to execute precise
arXiv Detail & Related papers (2024-06-06T03:13:34Z) - M$^3$CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought [50.576016777061724]
Multi-modal Chain-of-Thought (MCoT) requires models to leverage knowledge from both textual and visual modalities for step-by-step reasoning.
The current MCoT benchmark still faces some challenges: (1) absence of visual modal reasoning, (2) single-step visual modal reasoning, and (3) Domain missing.
We introduce a novel benchmark (M$3$CoT) to address the above challenges, advancing the multi-domain, multi-step, and multi-modal CoT.
arXiv Detail & Related papers (2024-05-26T07:56:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.