FAMMA: A Benchmark for Financial Domain Multilingual Multimodal Question Answering
- URL: http://arxiv.org/abs/2410.04526v2
- Date: Tue, 8 Oct 2024 05:06:05 GMT
- Title: FAMMA: A Benchmark for Financial Domain Multilingual Multimodal Question Answering
- Authors: Siqiao Xue, Tingting Chen, Fan Zhou, Qingyang Dai, Zhixuan Chu, Hongyuan Mei,
- Abstract summary: FAMMA is an open-source benchmark for financial multilingual multimodal question answering.
It includes 1,758 meticulously collected question-answer pairs from university textbooks and exams.
We evaluate a range of state-of-the-art MLLMs on our benchmark, and our analysis shows that FAMMA poses a significant challenge for these models.
- Score: 22.245216871611678
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we introduce FAMMA, an open-source benchmark for financial multilingual multimodal question answering (QA). Our benchmark aims to evaluate the abilities of multimodal large language models (MLLMs) in answering questions that require advanced financial knowledge and sophisticated reasoning. It includes 1,758 meticulously collected question-answer pairs from university textbooks and exams, spanning 8 major subfields in finance including corporate finance, asset management, and financial engineering. Some of the QA pairs are written in Chinese or French, while a majority of them are in English. These questions are presented in a mixed format combining text and heterogeneous image types, such as charts, tables, and diagrams. We evaluate a range of state-of-the-art MLLMs on our benchmark, and our analysis shows that FAMMA poses a significant challenge for these models. Even advanced systems like GPT-4o and Claude-35-Sonnet achieve only 42\% accuracy. Additionally, the open-source Qwen2-VL lags notably behind its proprietary counterparts. Lastly, we explore GPT o1-style reasoning chains to enhance the models' reasoning capabilities, which significantly improve error correction. Our FAMMA benchmark will facilitate future research to develop expert systems in financial QA. The leaderboard is available at https://famma-bench.github.io/famma/ .
Related papers
- Golden Touchstone: A Comprehensive Bilingual Benchmark for Evaluating Financial Large Language Models [22.594428755214356]
"Golden Touchstone" is the first comprehensive bilingual benchmark for financial LLMs.
benchmarks include a variety of financial tasks aimed at thoroughly assessing models' language understanding and generation capabilities.
We open-sourced Touchstone-GPT, a financial LLM trained through continual pre-training and financial instruction tuning.
arXiv Detail & Related papers (2024-11-09T20:09:11Z) - MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning [42.80085792749683]
We propose MME-Finance, an open-ended and practical usage-oriented Visual Question Answering (VQA) benchmark.
The characteristics of our benchmark are finance and expertise, which include constructing charts that reflect the actual usage needs of users.
In addition, we propose a Chinese version, which helps compare performance of MLLMs under a Chinese context.
arXiv Detail & Related papers (2024-11-05T18:59:51Z) - Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications [90.67346776473241]
Large language models (LLMs) have advanced financial applications, yet they often lack sufficient financial knowledge and struggle with tasks involving multi-modal inputs like tables and time series data.
We introduce textitOpen-FinLLMs, a series of Financial LLMs that embed comprehensive financial knowledge into text, tables, and time-series data.
We also present FinLLaVA, a multimodal LLM trained with 1.43M image-text instructions to handle complex financial data types.
arXiv Detail & Related papers (2024-08-20T16:15:28Z) - CFinBench: A Comprehensive Chinese Financial Benchmark for Large Language Models [61.324062412648075]
CFinBench is an evaluation benchmark for assessing the financial knowledge of large language models (LLMs) under Chinese context.
It comprises 99,100 questions spanning 43 second-level categories with 3 question types: single-choice, multiple-choice and judgment.
The results show that GPT4 and some Chinese-oriented models lead the benchmark, with the highest average accuracy being 60.16%.
arXiv Detail & Related papers (2024-07-02T14:34:36Z) - CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs [62.84082370758761]
CharXiv is a comprehensive evaluation suite involving 2,323 charts from arXiv papers.
To ensure quality, all charts and questions are handpicked, curated, and verified by human experts.
Results reveal a substantial, previously underestimated gap between the reasoning skills of the strongest proprietary model.
arXiv Detail & Related papers (2024-06-26T17:50:11Z) - SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation [50.061029816288936]
We present SciFIBench, a scientific figure interpretation benchmark.
Our main benchmark consists of a 1000-question gold set of multiple-choice questions split between two tasks across 12 categories.
The questions are curated from CS arXiv paper figures and captions, using adversarial filtering to find hard negatives and human verification for quality control.
We evaluate 26 LMMs on SciFIBench, finding it to be a challenging benchmark.
arXiv Detail & Related papers (2024-05-14T17:54:17Z) - FinBen: A Holistic Financial Benchmark for Large Language Models [75.09474986283394]
FinBen is the first extensive open-source evaluation benchmark, including 36 datasets spanning 24 financial tasks.
FinBen offers several key innovations: a broader range of tasks and datasets, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluation, and three novel open-source evaluation datasets for text summarization, question answering, and stock trading.
arXiv Detail & Related papers (2024-02-20T02:16:16Z) - FinanceBench: A New Benchmark for Financial Question Answering [28.865821741574237]
FinanceBench is a first-of-its-kind test suite for evaluating the performance of LLMs on open book financial question answering (QA)
It comprises 10,231 questions about publicly traded companies, with corresponding answers and evidence strings.
We test 16 state of the art model configurations on a sample of 150 cases from FinanceBench, and manually review their answers.
arXiv Detail & Related papers (2023-11-20T17:28:02Z) - DISC-FinLLM: A Chinese Financial Large Language Model based on Multiple
Experts Fine-tuning [74.99318727786337]
We propose Multiple Experts Fine-tuning Framework to build a financial large language model (LLM)
We build a financial instruction-tuning dataset named DISC-FIN-SFT, including instruction samples of four categories (consulting, NLP tasks, computing and retrieval-augmented generation)
Evaluations conducted on multiple benchmarks demonstrate that our model performs better than baseline models in various financial scenarios.
arXiv Detail & Related papers (2023-10-23T11:33:41Z) - FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for
Large Language Models [25.137098233579255]
FinEval is a benchmark for the financial domain knowledge in the large language models (LLMs)
FinEval employs a range of prompt types, including zero-shot and few-shot prompts, as well as answer-only and chain-of-thought prompts.
The results show that only GPT-4 achieved an accuracy close to 70% in different prompt settings.
arXiv Detail & Related papers (2023-08-19T10:38:00Z) - GPT-3 Models are Few-Shot Financial Reasoners [1.0742675209112622]
It is unknown how well pre-trained language models can reason in the financial domain.
We run several experiments with GPT-3 and find that a separate retrieval model and logic engine continue to be essential components.
With this understanding, our refined prompt-engineering approach on GPT-3 achieves near SOTA accuracy without any fine-tuning.
arXiv Detail & Related papers (2023-07-25T16:21:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.