FAMMA: A Benchmark for Financial Domain Multilingual Multimodal Question Answering
- URL: http://arxiv.org/abs/2410.04526v2
- Date: Tue, 8 Oct 2024 05:06:05 GMT
- Title: FAMMA: A Benchmark for Financial Domain Multilingual Multimodal Question Answering
- Authors: Siqiao Xue, Tingting Chen, Fan Zhou, Qingyang Dai, Zhixuan Chu, Hongyuan Mei,
- Abstract summary: FAMMA is an open-source benchmark for financial multilingual multimodal question answering.
It includes 1,758 meticulously collected question-answer pairs from university textbooks and exams.
We evaluate a range of state-of-the-art MLLMs on our benchmark, and our analysis shows that FAMMA poses a significant challenge for these models.
- Score: 22.245216871611678
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we introduce FAMMA, an open-source benchmark for financial multilingual multimodal question answering (QA). Our benchmark aims to evaluate the abilities of multimodal large language models (MLLMs) in answering questions that require advanced financial knowledge and sophisticated reasoning. It includes 1,758 meticulously collected question-answer pairs from university textbooks and exams, spanning 8 major subfields in finance including corporate finance, asset management, and financial engineering. Some of the QA pairs are written in Chinese or French, while a majority of them are in English. These questions are presented in a mixed format combining text and heterogeneous image types, such as charts, tables, and diagrams. We evaluate a range of state-of-the-art MLLMs on our benchmark, and our analysis shows that FAMMA poses a significant challenge for these models. Even advanced systems like GPT-4o and Claude-35-Sonnet achieve only 42\% accuracy. Additionally, the open-source Qwen2-VL lags notably behind its proprietary counterparts. Lastly, we explore GPT o1-style reasoning chains to enhance the models' reasoning capabilities, which significantly improve error correction. Our FAMMA benchmark will facilitate future research to develop expert systems in financial QA. The leaderboard is available at https://famma-bench.github.io/famma/ .
Related papers
- MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI [59.196131618912005]
Reasoning plays a crucial role in advancing Multimodal Large Language Models (MLLMs)<n>Existing MLLM benchmarks often fall short in precisely and comprehensively evaluating long-chain reasoning abilities.<n>We introduce MMReason, a new benchmark designed to precisely and comprehensively evaluate MLLM long-chain reasoning capability.
arXiv Detail & Related papers (2025-06-30T07:14:38Z) - General-Reasoner: Advancing LLM Reasoning Across All Domains [64.70599911897595]
Reinforcement learning (RL) has recently demonstrated strong potential in enhancing the reasoning capabilities of large language models (LLMs)<n>We propose General-Reasoner, a novel training paradigm designed to enhance LLM reasoning capabilities across diverse domains.<n>We train a series of models and evaluate them on a wide range of datasets covering wide domains like physics, chemistry, finance, electronics etc.
arXiv Detail & Related papers (2025-05-20T17:41:33Z) - KFinEval-Pilot: A Comprehensive Benchmark Suite for Korean Financial Language Understanding [6.3604109210772934]
KFinEval-Pilot is a benchmark suite specifically designed to evaluate large language models (LLMs) in the Korean financial domain.
It comprises over 1,000 curated questions across three critical areas: financial knowledge, legal reasoning, and financial toxicity.
arXiv Detail & Related papers (2025-04-17T00:12:58Z) - MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models [50.43793764203352]
We introduce MDK12-Bench, a multi-disciplinary benchmark assessing the reasoning capabilities of MLLMs via real-world K-12 examinations.<n>Our benchmark comprises 140K reasoning instances across diverse difficulty levels from primary school to 12th grade.<n>It features 6,827 instance-level knowledge point annotations based on a well-organized knowledge structure, detailed answer explanations, difficulty labels and cross-year partitions.
arXiv Detail & Related papers (2025-04-08T08:06:53Z) - Uhura: A Benchmark for Evaluating Scientific Question Answering and Truthfulness in Low-Resource African Languages [22.46637417012878]
We present Uhura -- a new benchmark that focuses on two tasks in six typologically-diverse African languages.<n>The first dataset, Uhura-ARC-Easy, is composed of multiple-choice science questions.<n>The second, Uhura-TruthfulQA, is a safety benchmark testing the truthfulness of models on topics including health, law, finance, and politics.
arXiv Detail & Related papers (2024-12-01T19:46:40Z) - Golden Touchstone: A Comprehensive Bilingual Benchmark for Evaluating Financial Large Language Models [22.594428755214356]
"Golden Touchstone" is the first comprehensive bilingual benchmark for financial LLMs.
benchmarks include a variety of financial tasks aimed at thoroughly assessing models' language understanding and generation capabilities.
We open-sourced Touchstone-GPT, a financial LLM trained through continual pre-training and financial instruction tuning.
arXiv Detail & Related papers (2024-11-09T20:09:11Z) - MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning [42.80085792749683]
We propose MME-Finance, an open-ended and practical usage-oriented Visual Question Answering (VQA) benchmark.
The characteristics of our benchmark are finance and expertise, which include constructing charts that reflect the actual usage needs of users.
In addition, we propose a Chinese version, which helps compare performance of MLLMs under a Chinese context.
arXiv Detail & Related papers (2024-11-05T18:59:51Z) - MQA-KEAL: Multi-hop Question Answering under Knowledge Editing for Arabic Language [7.488965571323756]
We propose Multi-hop Questioning Answering under Knowledge Editing for Arabic Language (MQA-KEAL)
MQA-KEAL stores knowledge edits as structured knowledge units in the external memory.
We also contribute MQUAKE-AR (Arabic translation of English benchmark MQUAKE) as well as a new benchmark MQA-AEVAL for rigorous performance evaluation of MQA under KE for Arabic language.
arXiv Detail & Related papers (2024-09-18T18:40:02Z) - Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications [90.67346776473241]
Large language models (LLMs) have advanced financial applications, yet they often lack sufficient financial knowledge and struggle with tasks involving multi-modal inputs like tables and time series data.
We introduce textitOpen-FinLLMs, a series of Financial LLMs that embed comprehensive financial knowledge into text, tables, and time-series data.
We also present FinLLaVA, a multimodal LLM trained with 1.43M image-text instructions to handle complex financial data types.
arXiv Detail & Related papers (2024-08-20T16:15:28Z) - CFinBench: A Comprehensive Chinese Financial Benchmark for Large Language Models [61.324062412648075]
CFinBench is an evaluation benchmark for assessing the financial knowledge of large language models (LLMs) under Chinese context.
It comprises 99,100 questions spanning 43 second-level categories with 3 question types: single-choice, multiple-choice and judgment.
The results show that GPT4 and some Chinese-oriented models lead the benchmark, with the highest average accuracy being 60.16%.
arXiv Detail & Related papers (2024-07-02T14:34:36Z) - CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs [62.84082370758761]
CharXiv is a comprehensive evaluation suite involving 2,323 charts from arXiv papers.
To ensure quality, all charts and questions are handpicked, curated, and verified by human experts.
Results reveal a substantial, previously underestimated gap between the reasoning skills of the strongest proprietary model.
arXiv Detail & Related papers (2024-06-26T17:50:11Z) - SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation [50.061029816288936]
We present SciFIBench, a scientific figure interpretation benchmark.
Our main benchmark consists of a 1000-question gold set of multiple-choice questions split between two tasks across 12 categories.
The questions are curated from CS arXiv paper figures and captions, using adversarial filtering to find hard negatives and human verification for quality control.
We evaluate 26 LMMs on SciFIBench, finding it to be a challenging benchmark.
arXiv Detail & Related papers (2024-05-14T17:54:17Z) - FanOutQA: A Multi-Hop, Multi-Document Question Answering Benchmark for Large Language Models [37.34801677290571]
FanOutQA is a high-quality dataset of fan-out question-answer pairs and human-annotated decompositions with English Wikipedia.
We formulate three benchmark settings across our dataset and benchmark 7 LLMs, including GPT-4, LLaMA 2, Claude-2.1, and Mixtral-8x7B.
arXiv Detail & Related papers (2024-02-21T20:30:45Z) - FinBen: A Holistic Financial Benchmark for Large Language Models [75.09474986283394]
FinBen is the first extensive open-source evaluation benchmark, including 36 datasets spanning 24 financial tasks.
FinBen offers several key innovations: a broader range of tasks and datasets, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluation, and three novel open-source evaluation datasets for text summarization, question answering, and stock trading.
arXiv Detail & Related papers (2024-02-20T02:16:16Z) - FinanceBench: A New Benchmark for Financial Question Answering [28.865821741574237]
FinanceBench is a first-of-its-kind test suite for evaluating the performance of LLMs on open book financial question answering (QA)
It comprises 10,231 questions about publicly traded companies, with corresponding answers and evidence strings.
We test 16 state of the art model configurations on a sample of 150 cases from FinanceBench, and manually review their answers.
arXiv Detail & Related papers (2023-11-20T17:28:02Z) - DISC-FinLLM: A Chinese Financial Large Language Model based on Multiple
Experts Fine-tuning [74.99318727786337]
We propose Multiple Experts Fine-tuning Framework to build a financial large language model (LLM)
We build a financial instruction-tuning dataset named DISC-FIN-SFT, including instruction samples of four categories (consulting, NLP tasks, computing and retrieval-augmented generation)
Evaluations conducted on multiple benchmarks demonstrate that our model performs better than baseline models in various financial scenarios.
arXiv Detail & Related papers (2023-10-23T11:33:41Z) - FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for
Large Language Models [25.137098233579255]
FinEval is a benchmark for the financial domain knowledge in the large language models (LLMs)
FinEval employs a range of prompt types, including zero-shot and few-shot prompts, as well as answer-only and chain-of-thought prompts.
The results show that only GPT-4 achieved an accuracy close to 70% in different prompt settings.
arXiv Detail & Related papers (2023-08-19T10:38:00Z) - GPT-3 Models are Few-Shot Financial Reasoners [1.0742675209112622]
It is unknown how well pre-trained language models can reason in the financial domain.
We run several experiments with GPT-3 and find that a separate retrieval model and logic engine continue to be essential components.
With this understanding, our refined prompt-engineering approach on GPT-3 achieves near SOTA accuracy without any fine-tuning.
arXiv Detail & Related papers (2023-07-25T16:21:07Z) - PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark
for Finance [63.51545277822702]
PIXIU is a comprehensive framework including the first financial large language model (LLMs) based on fine-tuning LLaMA with instruction data.
We propose FinMA by fine-tuning LLaMA with the constructed dataset to be able to follow instructions for various financial tasks.
We conduct a detailed analysis of FinMA and several existing LLMs, uncovering their strengths and weaknesses in handling critical financial tasks.
arXiv Detail & Related papers (2023-06-08T14:20:29Z) - Learn to Explain: Multimodal Reasoning via Thought Chains for Science
Question Answering [124.16250115608604]
We present Science Question Answering (SQA), a new benchmark that consists of 21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations.
We show that SQA improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA.
Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data.
arXiv Detail & Related papers (2022-09-20T07:04:24Z) - MultiModalQA: Complex Question Answering over Text, Tables and Images [52.25399438133274]
We present MultiModalQA: a dataset that requires joint reasoning over text, tables and images.
We create MMQA using a new framework for generating complex multi-modal questions at scale.
We then define a formal language that allows us to take questions that can be answered from a single modality, and combine them to generate cross-modal questions.
arXiv Detail & Related papers (2021-04-13T09:14:28Z) - Reinforced Multi-task Approach for Multi-hop Question Generation [47.15108724294234]
We take up Multi-hop question generation, which aims at generating relevant questions based on supporting facts in the context.
We employ multitask learning with the auxiliary task of answer-aware supporting fact prediction to guide the question generator.
We demonstrate the effectiveness of our approach through experiments on the multi-hop question answering dataset, HotPotQA.
arXiv Detail & Related papers (2020-04-05T10:16:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.