MME-Industry: A Cross-Industry Multimodal Evaluation Benchmark
- URL: http://arxiv.org/abs/2501.16688v1
- Date: Tue, 28 Jan 2025 03:56:17 GMT
- Title: MME-Industry: A Cross-Industry Multimodal Evaluation Benchmark
- Authors: Dongyi Yi, Guibo Zhu, Chenglin Ding, Zongshu Li, Dong Yi, Jinqiao Wang,
- Abstract summary: We introduce MME-Industry, a novel benchmark designed specifically for evaluating MLLMs in industrial settings.
The benchmark encompasses 21 distinct domain, comprising 1050 question-answer pairs with 50 questions per domain.
We provide both Chinese and English versions of the benchmark, enabling comparative analysis of MLLMs' capabilities across these languages.
- Score: 20.642661835794975
- License:
- Abstract: With the rapid advancement of Multimodal Large Language Models (MLLMs), numerous evaluation benchmarks have emerged. However, comprehensive assessments of their performance across diverse industrial applications remain limited. In this paper, we introduce MME-Industry, a novel benchmark designed specifically for evaluating MLLMs in industrial settings.The benchmark encompasses 21 distinct domain, comprising 1050 question-answer pairs with 50 questions per domain. To ensure data integrity and prevent potential leakage from public datasets, all question-answer pairs were manually crafted and validated by domain experts. Besides, the benchmark's complexity is effectively enhanced by incorporating non-OCR questions that can be answered directly, along with tasks requiring specialized domain knowledge. Moreover, we provide both Chinese and English versions of the benchmark, enabling comparative analysis of MLLMs' capabilities across these languages. Our findings contribute valuable insights into MLLMs' practical industrial applications and illuminate promising directions for future model optimization research.
Related papers
- EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents [57.4686961979566]
EmbodiedEval is a comprehensive and interactive evaluation benchmark for MLLMs with embodied tasks.
It covers a broad spectrum of existing embodied AI tasks with significantly enhanced diversity.
We evaluated the state-of-the-art MLLMs on EmbodiedEval and found that they have a significant shortfall compared to human level on embodied tasks.
arXiv Detail & Related papers (2025-01-21T03:22:10Z) - Redundancy Principles for MLLMs Benchmarks [72.88776679131253]
We focus on redundancy from three key perspectives: 1) Redundancy of benchmark capability dimensions, 2) Redundancy in the number of test questions, and 3) Cross-benchmark redundancy within specific domains.
Through the comprehensive analysis over hundreds of MLLMs' performance across more than 20 benchmarks, we aim to quantitatively measure the level of redundancy lies in existing MLLM evaluations.
arXiv Detail & Related papers (2025-01-20T08:09:42Z) - MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)
MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.
It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - MMAD: The First-Ever Comprehensive Benchmark for Multimodal Large Language Models in Industrial Anomaly Detection [66.05200339481115]
We present MMAD, the first-ever full-spectrum MLLMs benchmark in industrial anomaly detection.
We defined seven key subtasks of MLLMs in industrial inspection and designed a novel pipeline to generate the MMAD dataset.
With MMAD, we have conducted a comprehensive, quantitative evaluation of various state-of-the-art MLLMs.
arXiv Detail & Related papers (2024-10-12T09:16:09Z) - A Survey on Multimodal Benchmarks: In the Era of Large AI Models [13.299775710527962]
Multimodal Large Language Models (MLLMs) have brought substantial advancements in artificial intelligence.
This survey systematically reviews 211 benchmarks that assess MLLMs across four core domains: understanding, reasoning, generation, and application.
arXiv Detail & Related papers (2024-09-21T15:22:26Z) - A Survey on Benchmarks of Multimodal Large Language Models [65.87641718350639]
This paper presents a comprehensive review of 200 benchmarks and evaluations for Multimodal Large Language Models (MLLMs)
We focus on (1)perception and understanding, (2)cognition and reasoning, (3)specific domains, (4)key capabilities, and (5)other modalities.
Our key argument is that evaluation should be regarded as a crucial discipline to support the development of MLLMs better.
arXiv Detail & Related papers (2024-08-16T09:52:02Z) - MM-BigBench: Evaluating Multimodal Models on Multimodal Content
Comprehension Tasks [56.60050181186531]
We introduce MM-BigBench, which incorporates a diverse range of metrics to offer an extensive evaluation of the performance of various models and instructions.
Our paper evaluates a total of 20 language models (14 MLLMs) on 14 multimodal datasets spanning 6 tasks, with 10 instructions for each task, and derives novel insights.
arXiv Detail & Related papers (2023-10-13T11:57:04Z) - Empower Large Language Model to Perform Better on Industrial
Domain-Specific Question Answering [36.31193273252256]
Large Language Model (LLM) has gained popularity and achieved remarkable results in open-domain tasks.
But its performance in real industrial domain-specific scenarios is average due to its lack of specific domain knowledge.
We provide a benchmark Question Answering (QA) dataset named MSQA, centered around Microsoft products and IT technical problems encountered by customers.
arXiv Detail & Related papers (2023-05-19T09:23:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.