R-Bench: Are your Large Multimodal Model Robust to Real-world Corruptions?
- URL: http://arxiv.org/abs/2410.05474v1
- Date: Mon, 7 Oct 2024 20:12:08 GMT
- Title: R-Bench: Are your Large Multimodal Model Robust to Real-world Corruptions?
- Authors: Chunyi Li, Jianbo Zhang, Zicheng Zhang, Haoning Wu, Yuan Tian, Wei Sun, Guo Lu, Xiaohong Liu, Xiongkuo Min, Weisi Lin, Guangtao Zhai,
- Abstract summary: R-Bench is a benchmark focused on the **Real-world Robustness of LMMs**.
We show that while LMMs can correctly handle the original reference images, their performance is not stable when faced with distorted images.
We hope that R-Bench will inspire improving the robustness of LMMs, **extending them from experimental simulations to the real-world application**.
- Score: 86.94616033250068
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The outstanding performance of Large Multimodal Models (LMMs) has made them widely applied in vision-related tasks. However, various corruptions in the real world mean that images will not be as ideal as in simulations, presenting significant challenges for the practical application of LMMs. To address this issue, we introduce R-Bench, a benchmark focused on the **Real-world Robustness of LMMs**. Specifically, we: (a) model the complete link from user capture to LMMs reception, comprising 33 corruption dimensions, including 7 steps according to the corruption sequence, and 7 groups based on low-level attributes; (b) collect reference/distorted image dataset before/after corruption, including 2,970 question-answer pairs with human labeling; (c) propose comprehensive evaluation for absolute/relative robustness and benchmark 20 mainstream LMMs. Results show that while LMMs can correctly handle the original reference images, their performance is not stable when faced with distorted images, and there is a significant gap in robustness compared to the human visual system. We hope that R-Bench will inspire improving the robustness of LMMs, **extending them from experimental simulations to the real-world application**. Check https://q-future.github.io/R-Bench for details.
Related papers
- MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective [32.55432949789787]
Large Multimodal Models (LMMs) have demonstrated remarkable capabilities.
We propose a straightforward automated evaluation pipeline that requires LMMs to generate an image-prompt from a given input image.
We then employ text-to-image generative models to create a new image based on these generated prompts.
Finally, we evaluate the performance of LMMs by comparing the original image with the generated one.
arXiv Detail & Related papers (2024-11-21T12:16:16Z) - HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks [25.959032350818795]
HumanEval-V is a benchmark designed to evaluate Large Language Models' visual understanding and reasoning capabilities through code generation.
HumanEval-V includes 108 carefully crafted, entry-level Python coding tasks derived from platforms like CodeForces and Stack Overflow.
We evaluate 19 state-of-the-art LMMs using HumanEval-V, uncovering significant challenges.
arXiv Detail & Related papers (2024-10-16T09:04:57Z) - Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models [57.280853324896306]
Multimodal large language models (MLLMs) struggle to recognize and interpret intricate details in high-resolution (HR) images effectively.
We introduce HR-Bench, the first deliberately designed benchmark to rigorously evaluate MLLM performance on 4K&8K images.
We propose Divide, Conquer and Combine (DC$2$), a novel training-free framework for enhancing MLLM perception of HR images.
arXiv Detail & Related papers (2024-08-28T06:09:02Z) - MMR: Evaluating Reading Ability of Large Multimodal Models [52.953316772123586]
Large multimodal models (LMMs) have demonstrated impressive capabilities in understanding various types of image, including text-rich images.
Current benchmarks fail to accurately reflect performance of different models.
We propose the Multi-Modal Reading (MMR) benchmark in 11 diverse tasks to evaluate LMMs for text-rich image understanding.
arXiv Detail & Related papers (2024-08-26T19:26:50Z) - VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents [50.12414817737912]
Large Multimodal Models (LMMs) have ushered in a new era in artificial intelligence, merging capabilities in both language and vision to form highly capable Visual Foundation Agents.
Existing benchmarks fail to sufficiently challenge or showcase the full potential of LMMs in complex, real-world environments.
VisualAgentBench (VAB) is a pioneering benchmark specifically designed to train and evaluate LMMs as visual foundation agents.
arXiv Detail & Related papers (2024-08-12T17:44:17Z) - F-LMM: Grounding Frozen Large Multimodal Models [53.8059045627934]
We present F-LMM -- grounding frozen off-the-shelf LMMs in human-AI conversations.
Using only a few trainable CNN layers, we can translate word-pixel attention weights to mask logits.
Our F-LMM neither learns special segmentation tokens nor utilises high-quality grounded instruction-tuning data.
arXiv Detail & Related papers (2024-06-09T15:14:26Z) - Benchmarking Large Multimodal Models against Common Corruptions [45.26424202601339]
This technical report aims to fill a deficiency in the assessment of large multimodal models (LMMs)
We investigate the cross-modal interactions between text, image, and speech, encompassing four essential generation tasks.
We create a benchmark, named MMCBench, that covers more than 100 popular LMMs.
arXiv Detail & Related papers (2024-01-22T13:33:53Z) - Lightweight high-resolution Subject Matting in the Real World [43.56357473163735]
We construct a saliency object matting dataset HRSOM and a lightweight network PSUNet.
Considering efficient inference of mobile depolyment framework, we design a symmetric pixel shuffle module and a lightweight module TRSU.
Compared to 13 SOD methods, the proposed PSUNet has the best objective performance on the high-resolution benchmark dataset.
arXiv Detail & Related papers (2023-12-12T09:27:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.