HERM: Benchmarking and Enhancing Multimodal LLMs for Human-Centric Understanding
- URL: http://arxiv.org/abs/2410.06777v1
- Date: Wed, 9 Oct 2024 11:14:07 GMT
- Title: HERM: Benchmarking and Enhancing Multimodal LLMs for Human-Centric Understanding
- Authors: Keliang Li, Zaifei Yang, Jiahe Zhao, Hongze Shen, Ruibing Hou, Hong Chang, Shiguang Shan, Xilin Chen,
- Abstract summary: We introduce HERM-Bench, a benchmark for evaluating the human-centric understanding capabilities of MLLMs.
Our work reveals the limitations of existing MLLMs in understanding complex human-centric scenarios.
We present HERM-100K, a comprehensive dataset with multi-level human-centric annotations, aimed at enhancing MLLMs' training.
- Score: 68.4046326104724
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The significant advancements in visual understanding and instruction following from Multimodal Large Language Models (MLLMs) have opened up more possibilities for broader applications in diverse and universal human-centric scenarios. However, existing image-text data may not support the precise modality alignment and integration of multi-grained information, which is crucial for human-centric visual understanding. In this paper, we introduce HERM-Bench, a benchmark for evaluating the human-centric understanding capabilities of MLLMs. Our work reveals the limitations of existing MLLMs in understanding complex human-centric scenarios. To address these challenges, we present HERM-100K, a comprehensive dataset with multi-level human-centric annotations, aimed at enhancing MLLMs' training. Furthermore, we develop HERM-7B, a MLLM that leverages enhanced training data from HERM-100K. Evaluations on HERM-Bench demonstrate that HERM-7B significantly outperforms existing MLLMs across various human-centric dimensions, reflecting the current inadequacy of data annotations used in MLLM training for human-centric visual understanding. This research emphasizes the importance of specialized datasets and benchmarks in advancing the MLLMs' capabilities for human-centric understanding.
Related papers
- Understanding the Role of LLMs in Multimodal Evaluation Benchmarks [77.59035801244278]
This paper investigates the role of the Large Language Model (LLM) backbone in Multimodal Large Language Models (MLLMs) evaluation.
Our study encompasses four diverse MLLM benchmarks and eight state-of-the-art MLLMs.
Key findings reveal that some benchmarks allow high performance even without visual inputs and up to 50% of error rates can be attributed to insufficient world knowledge in the LLM backbone.
arXiv Detail & Related papers (2024-10-16T07:49:13Z) - A Survey on Evaluation of Multimodal Large Language Models [11.572066870077888]
Multimodal Large Language Models (MLLMs) mimic human perception and reasoning system by integrating powerful Large Language Models (LLMs)
This framework endows MLLMs with human-like capabilities, and suggests a potential pathway towards achieving artificial general intelligence (AGI)
With the emergence of all-round MLLMs like GPT-4V and Gemini, a multitude of evaluation methods have been developed to assess their capabilities across different dimensions.
arXiv Detail & Related papers (2024-08-28T13:05:55Z) - Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language Models [12.841405829775852]
We introduce the modality importance score (MIS) to identify bias inVidQA benchmarks and datasets.
We also propose an innovative method using state-of-the-art MLLMs to estimate the modality importance.
Our results indicate that current models do not effectively integrate information due to modality imbalance in existing datasets.
arXiv Detail & Related papers (2024-08-22T23:32:42Z) - A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks [74.52259252807191]
Multimodal Large Language Models (MLLMs) address the complexities of real-world applications far beyond the capabilities of single-modality systems.
This paper systematically sorts out the applications of MLLM in multimodal tasks such as natural language, vision, and audio.
arXiv Detail & Related papers (2024-08-02T15:14:53Z) - MMRel: A Relation Understanding Benchmark in the MLLM Era [72.95901753186227]
Multi-Modal Relation Understanding (MMRel) is a benchmark that features large-scale, high-quality, and diverse data on inter-object relations.
MMRel is ideal for evaluating MLLMs on relation understanding, as well as for fine-tuning MLLMs to enhance relation comprehension capability.
arXiv Detail & Related papers (2024-06-13T13:51:59Z) - MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark [41.68821233828375]
This paper introduces a novel benchmark, termed MLLM-as-a-Judge, to assess the ability of MLLMs in assisting judges across diverse modalities.
Our study reveals that, while MLLMs demonstrate remarkable human-like discernment in Pair Comparison, there is a significant divergence from human preferences in Scoring Evaluation and Batch Ranking.
arXiv Detail & Related papers (2024-02-07T12:28:32Z) - Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness
and Ethics [32.123919380959485]
Multi-modal large language models (MLLMs) are trained based on large language models (LLM)
While they excel in multi-modal tasks, the pure NLP abilities of MLLMs are often underestimated and left untested.
We show that visual instruction tuning, a prevailing strategy for transitioning LLMs into MLLMs, unexpectedly and interestingly helps models attain both improved truthfulness and ethical alignment.
arXiv Detail & Related papers (2023-09-13T17:57:21Z) - Aligning Large Language Models with Human: A Survey [53.6014921995006]
Large Language Models (LLMs) trained on extensive textual corpora have emerged as leading solutions for a broad array of Natural Language Processing (NLP) tasks.
Despite their notable performance, these models are prone to certain limitations such as misunderstanding human instructions, generating potentially biased content, or factually incorrect information.
This survey presents a comprehensive overview of these alignment technologies, including the following aspects.
arXiv Detail & Related papers (2023-07-24T17:44:58Z) - LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset,
Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark.
Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs.
We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.