InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal
Large Language Models
- URL: http://arxiv.org/abs/2311.11567v3
- Date: Mon, 4 Dec 2023 20:55:53 GMT
- Title: InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal
Large Language Models
- Authors: Xiaotian Han, Quanzeng You, Yongfei Liu, Wentao Chen, Huangjie Zheng,
Khalil Mrini, Xudong Lin, Yiqi Wang, Bohan Zhai, Jianbo Yuan, Heng Wang,
Hongxia Yang
- Abstract summary: Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence.
Our benchmark comprises three key reasoning categories: deductive, abductive, and analogical reasoning.
We evaluate a selection of representative MLLMs using this rigorously developed open-ended multi-step elaborate reasoning benchmark.
- Score: 50.03163753638256
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Multi-modal Large Language Models (MLLMs) are increasingly prominent in the
field of artificial intelligence. These models not only excel in traditional
vision-language tasks but also demonstrate impressive performance in
contemporary multi-modal benchmarks. Although many of these benchmarks attempt
to holistically evaluate MLLMs, they typically concentrate on basic reasoning
tasks, often yielding only simple yes/no or multi-choice responses. These
methods naturally lead to confusion and difficulties in conclusively
determining the reasoning capabilities of MLLMs. To mitigate this issue, we
manually curate a benchmark dataset specifically designed for MLLMs, with a
focus on complex reasoning tasks. Our benchmark comprises three key reasoning
categories: deductive, abductive, and analogical reasoning. The queries in our
dataset are intentionally constructed to engage the reasoning capabilities of
MLLMs in the process of generating answers. For a fair comparison across
various MLLMs, we incorporate intermediate reasoning steps into our evaluation
criteria. In instances where an MLLM is unable to produce a definitive answer,
its reasoning ability is evaluated by requesting intermediate reasoning steps.
If these steps align with our manual annotations, appropriate scores are
assigned. This evaluation scheme resembles methods commonly used in human
assessments, such as exams or assignments, and represents what we consider a
more effective assessment technique compared with existing benchmarks. We
evaluate a selection of representative MLLMs using this rigorously developed
open-ended multi-step elaborate reasoning benchmark, designed to challenge and
accurately measure their reasoning capabilities. The code and data will be
released at https://infimm.github.io/InfiMM-Eval/
Related papers
- MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.
We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.
Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - Look Before You Decide: Prompting Active Deduction of MLLMs for Assumptive Reasoning [68.83624133567213]
We show that most prevalent MLLMs can be easily fooled by the introduction of a presupposition into the question.
We also propose a simple yet effective method, Active Deduction (AD), to encourage the model to actively perform composite deduction.
arXiv Detail & Related papers (2024-04-19T15:53:27Z) - NPHardEval4V: A Dynamic Reasoning Benchmark of Multimodal Large Language
Models [34.91372939329467]
We introduce a benchmark, NPHardEval4V, to evaluate the pure reasoning abilities of MLLMs.
Our findings reveal significant discrepancies in reasoning abilities across different models.
We also investigate the impact of different prompting styles, including visual, text, and combined visual and text prompts, on the reasoning abilities of MLLMs.
arXiv Detail & Related papers (2024-03-04T07:10:31Z) - Cofca: A Step-Wise Counterfactual Multi-hop QA benchmark [39.64489055580211]
We introduce a Step-wise Counterfactual benchmark (CofCA), a novel evaluation benchmark consisting of factual data and counterfactual data.
Our experimental results reveal a significant performance gap between Wikipedia-based factual data and counterfactual data, deeming data contamination issues in existing benchmarks.
arXiv Detail & Related papers (2024-02-19T08:12:30Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria [49.500322937449326]
Multimodal large language models (MLLMs) have broadened the scope of AI applications.
Existing automatic evaluation methodologies for MLLMs are mainly limited in evaluating queries without considering user experiences.
We propose a new evaluation paradigm for MLLMs, which is evaluating MLLMs with per-sample criteria using potent MLLM as the judge.
arXiv Detail & Related papers (2023-11-23T12:04:25Z) - MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models [73.86954509967416]
Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks.
This paper presents the first comprehensive MLLM Evaluation benchmark MME.
It measures both perception and cognition abilities on a total of 14 subtasks.
arXiv Detail & Related papers (2023-06-23T09:22:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.