Benchmarking Large Multimodal Models against Common Corruptions
- URL: http://arxiv.org/abs/2401.11943v1
- Date: Mon, 22 Jan 2024 13:33:53 GMT
- Title: Benchmarking Large Multimodal Models against Common Corruptions
- Authors: Jiawei Zhang, Tianyu Pang, Chao Du, Yi Ren, Bo Li, Min Lin
- Abstract summary: This technical report aims to fill a deficiency in the assessment of large multimodal models (LMMs)
We investigate the cross-modal interactions between text, image, and speech, encompassing four essential generation tasks.
We create a benchmark, named MMCBench, that covers more than 100 popular LMMs.
- Score: 45.26424202601339
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This technical report aims to fill a deficiency in the assessment of large
multimodal models (LMMs) by specifically examining the self-consistency of
their outputs when subjected to common corruptions. We investigate the
cross-modal interactions between text, image, and speech, encompassing four
essential generation tasks: text-to-image, image-to-text, text-to-speech, and
speech-to-text. We create a comprehensive benchmark, named MMCBench, that
covers more than 100 popular LMMs (totally over 150 model checkpoints). A
thorough evaluation under common corruptions is critical for practical
deployment and facilitates a better understanding of the reliability of
cutting-edge LMMs. The benchmarking code is available at
https://github.com/sail-sg/MMCBench
Related papers
- CAMEL-Bench: A Comprehensive Arabic LMM Benchmark [10.20074702234283]
We develop a comprehensive LMM evaluation benchmark for the Arabic language to represent a large population of over 400 million speakers.
The proposed benchmark comprises eight diverse domains and 38 sub-domains including, multi-image understanding, complex visual perception, handwritten document understanding, video understanding, medical imaging, plant diseases, and remote sensing-based land use understanding.
arXiv Detail & Related papers (2024-10-24T17:59:38Z) - MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)
MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.
It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - R-Bench: Are your Large Multimodal Model Robust to Real-world Corruptions? [86.94616033250068]
R-Bench is a benchmark focused on the **Real-world Robustness of LMMs**.
We show that while LMMs can correctly handle the original reference images, their performance is not stable when faced with distorted images.
We hope that R-Bench will inspire improving the robustness of LMMs, **extending them from experimental simulations to the real-world application**.
arXiv Detail & Related papers (2024-10-07T20:12:08Z) - MMR: Evaluating Reading Ability of Large Multimodal Models [52.953316772123586]
Large multimodal models (LMMs) have demonstrated impressive capabilities in understanding various types of image, including text-rich images.
Current benchmarks fail to accurately reflect performance of different models.
We propose the Multi-Modal Reading (MMR) benchmark in 11 diverse tasks to evaluate LMMs for text-rich image understanding.
arXiv Detail & Related papers (2024-08-26T19:26:50Z) - mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval [67.50604814528553]
We first introduce a text encoder enhanced with RoPE and unpadding, pre-trained in a native 8192-token context.
Then we construct a hybrid TRM and a cross-encoder reranker by contrastive learning.
arXiv Detail & Related papers (2024-07-29T03:12:28Z) - LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models [71.8065384742686]
LMMS-EVAL is a unified and standardized multimodal benchmark framework with over 50 tasks and more than 10 models.
LMMS-EVAL LITE is a pruned evaluation toolkit that emphasizes both coverage and efficiency.
Multimodal LIVEBENCH utilizes continuously updating news and online forums to assess models' generalization abilities in the wild.
arXiv Detail & Related papers (2024-07-17T17:51:53Z) - Compositional Chain-of-Thought Prompting for Large Multimodal Models [46.721769077885966]
Compositional Chain-of-Thought (CCoT) is a novel zero-shot Chain-of-Thought prompting method.
We first generate an SG using the Large Language Model (LLM) and then use that SG in the prompt to produce a response.
We find that the proposed CCoT approach not only improves LMM performance but also improves the performance of several popular LMMs on general multimodal benchmarks.
arXiv Detail & Related papers (2023-11-27T22:23:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.