Related papers: MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

URL: http://arxiv.org/abs/2410.10563v3
Date: Sun, 13 Jul 2025 23:07:08 GMT
Title: MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks
Authors: Jiacheng Chen, Tianhao Liang, Sherman Siu, Zhengqing Wang, Kai Wang, Yubo Wang, Yuansheng Ni, Wang Zhu, Ziyan Jiang, Bohan Lyu, Dongfu Jiang, Xuan He, Yuan Liu, Hexiang Hu, Xiang Yue, Wenhu Chen,
Abstract summary: MEGA-Bench is an evaluation suite that scales multimodal evaluation to over 500 real-world tasks.<n>We collected 505 tasks encompassing over 8,000 samples from 16 expert annotators to extensively cover the multimodal task space.
Score: 49.59854479079552
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present MEGA-Bench, an evaluation suite that scales multimodal evaluation to over 500 real-world tasks, to address the highly heterogeneous daily use cases of end users. Our objective is to optimize for a set of high-quality data samples that cover a highly diverse and rich set of multimodal tasks, while enabling cost-effective and accurate model evaluation. In particular, we collected 505 realistic tasks encompassing over 8,000 samples from 16 expert annotators to extensively cover the multimodal task space. Instead of unifying these problems into standard multi-choice questions (like MMMU, MMBench, and MMT-Bench), we embrace a wide range of output formats like numbers, phrases, code, \LaTeX, coordinates, JSON, free-form, etc. To accommodate these formats, we developed over 40 metrics to evaluate these tasks. Unlike existing benchmarks, MEGA-Bench offers a fine-grained capability report across multiple dimensions (e.g., application, input type, output format, skill), allowing users to interact with and visualize model capabilities in depth. We evaluate a wide variety of frontier vision-language models on MEGA-Bench to understand their capabilities across these dimensions.

Related papers

What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities [56.646832992178105]
We introduce OmniBench, a cross-platform, graph-based benchmark with an automated pipeline for synthesizing tasks of controllable complexity.<n>We present OmniEval, a multidimensional evaluation framework that includes subtask-level evaluation, graph-based metrics, and comprehensive tests across 10 capabilities.<n>Our dataset contains 36k graph-structured tasks across 20 scenarios, achieving a 91% human acceptance rate.
arXiv Detail & Related papers (2025-06-10T15:59:38Z)
FormFactory: An Interactive Benchmarking Suite for Multimodal Form-Filling Agents [36.11725924594441]
Current online form filling tools are largely rule-based and lack generalizable, generative capabilities.<n>We propose FormFactory, an interactive benchmarking suite comprising a web-based interface, backend evaluation module, and dataset.<n>Our benchmark covers diverse real-world scenarios, incorporates various field formats, and simulates high-fidelity form interactions.
arXiv Detail & Related papers (2025-06-02T10:34:57Z)
OmniGenBench: A Benchmark for Omnipotent Multimodal Generation across 50+ Tasks [77.19223035769248]
Recent breakthroughs in large multimodal models (LMMs) have demonstrated remarkable proficiency in following general-purpose instructions for image generation.<n>We introduce OmniGenBench, a novel benchmark meticulously designed to assess the instruction-following abilities of state-of-the-art LMMs.<n>Our OmniGenBench includes 57 diverse sub-tasks grounded in real-world scenarios, systematically categorized according to the specific model capabilities they demand.
arXiv Detail & Related papers (2025-05-24T16:29:34Z)
Judge Anything: MLLM as a Judge Across Any Modality [43.51517213949702]
This paper introduces two benchmarks, TaskAnything and JudgeAnything, to evaluate the overall performance and judging capabilities of MLLMs across any-to-any modality tasks. TaskAnything evaluates the MMU and MMG capabilities across 15 any-to-any modality categories, employing 1,500 queries curated from well-established benchmarks. JudgeAnything evaluates the judging capabilities of 5 advanced (e.g., GPT-4o and Gemini-2.0-Flash) from the perspectives of Pair Comparison and Score Evaluation. Our work highlights the need for fairer evaluation protocols and stronger alignment with human preferences.
arXiv Detail & Related papers (2025-03-21T18:59:20Z)
MMTEB: Massive Multilingual Text Embedding Benchmark [85.18187649328792]
We introduce the Massive Multilingual Text Embedding Benchmark (MMTEB) MMTEB covers over 500 quality-controlled evaluation tasks across 250+ languages. We develop several highly multilingual benchmarks, which we use to evaluate a representative set of models.
arXiv Detail & Related papers (2025-02-19T10:13:43Z)
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents [63.43699771428243]
EmbodiedBench is an extensive benchmark designed to evaluate vision-driven embodied agents. We evaluated 19 leading proprietary and open-source MLLMs within EmbodiedBench. MLLMs excel at high-level tasks but struggle with low-level manipulation.
arXiv Detail & Related papers (2025-02-13T18:11:34Z)
MRAMG-Bench: A Comprehensive Benchmark for Advancing Multimodal Retrieval-Augmented Multimodal Generation [19.745059794932807]
We introduce the Multimodal Retrieval-Augmented Multimodal Generation (MRAMG) task. We aim to generate multimodal answers that combine both text and images, fully leveraging the multimodal data within a corpus. To facilitate rigorous evaluation, MRAMG-Bench incorporates a comprehensive suite of both statistical and LLM-based metrics.
arXiv Detail & Related papers (2025-02-06T16:07:24Z)
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs) MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts. It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z)
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI [71.53579367538725]
MMT-Bench is a benchmark designed to assess Large Vision-Language Models (LVLMs) across massive multimodal tasks. MMT-Bench comprises $31,325$ meticulously curated multi-choice visual questions from various multimodal scenarios.
arXiv Detail & Related papers (2024-04-24T17:37:05Z)
MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks [56.60050181186531]
We introduce MM-BigBench, which incorporates a diverse range of metrics to offer an extensive evaluation of the performance of various models and instructions. Our paper evaluates a total of 20 language models (14 MLLMs) on 14 multimodal datasets spanning 6 tasks, with 10 instructions for each task, and derives novel insights.
arXiv Detail & Related papers (2023-10-13T11:57:04Z)
Multimodal Question Answering for Unified Information Extraction [15.798187192290746]
Multimodal information extraction aims to extract structured information from unstructured multimedia content. Most current MIE models are task-specific and data-intensive. We propose a novel multimodal question answering (MQA) framework to unify three MIE tasks.
arXiv Detail & Related papers (2023-10-04T17:58:05Z)
Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech [107.81472531864195]
Text language models have shown remarkable zero-shot capability in generalizing to unseen tasks when provided with well-formulated instructions. We present Dynamic-SUPERB, a benchmark for building universal speech models capable of leveraging instruction tuning to perform multiple tasks in a zero-shot fashion.
arXiv Detail & Related papers (2023-09-18T06:43:30Z)
M5Product: A Multi-modal Pretraining Benchmark for E-commercial Product Downstream Tasks [94.80043324367858]
We contribute a large-scale dataset, named M5Product, which consists of over 6 million multimodal pairs. M5Product contains rich information of multiple modalities including image, text, table, video and audio.
arXiv Detail & Related papers (2021-09-09T13:50:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.