Face-MLLM: A Large Face Perception Model
- URL: http://arxiv.org/abs/2410.20717v1
- Date: Mon, 28 Oct 2024 04:19:32 GMT
- Title: Face-MLLM: A Large Face Perception Model
- Authors: Haomiao Sun, Mingjie He, Tianheng Lian, Hu Han, Shiguang Shan,
- Abstract summary: multimodal large language models (MLLMs) have achieved promising results on a wide range of vision-language tasks, but their ability to perceive and understand human faces is rarely explored.
In this work, we comprehensively evaluate existing MLLMs on face perception tasks.
Our model surpasses previous MLLMs on five famous face perception tasks.
- Score: 53.9441375205716
- License:
- Abstract: Although multimodal large language models (MLLMs) have achieved promising results on a wide range of vision-language tasks, their ability to perceive and understand human faces is rarely explored. In this work, we comprehensively evaluate existing MLLMs on face perception tasks. The quantitative results reveal that existing MLLMs struggle to handle these tasks. The primary reason is the lack of image-text datasets that contain fine-grained descriptions of human faces. To tackle this problem, we design a practical pipeline for constructing datasets, upon which we further build a novel multimodal large face perception model, namely Face-MLLM. Specifically, we re-annotate LAION-Face dataset with more detailed face captions and facial attribute labels. Besides, we re-formulate traditional face datasets using the question-answer style, which is fit for MLLMs. Together with these enriched datasets, we develop a novel three-stage MLLM training method. In the first two stages, our model learns visual-text alignment and basic visual question answering capability, respectively. In the third stage, our model learns to handle multiple specialized face perception tasks. Experimental results show that our model surpasses previous MLLMs on five famous face perception tasks. Besides, on our newly introduced zero-shot facial attribute analysis task, our Face-MLLM also presents superior performance.
Related papers
- FaceXBench: Evaluating Multimodal LLMs on Face Understanding [30.86305376082235]
We introduce FaceXBench, a benchmark designed to evaluate MLLMs on complex face understanding tasks.
FaceXBench includes 5,000 multimodal multiple-choice questions derived from 25 public datasets and a newly created dataset, FaceXAPI.
We conduct an extensive evaluation of 26 open-source MLLMs alongside 2 proprietary models, revealing the unique challenges in complex face understanding tasks.
arXiv Detail & Related papers (2025-01-17T18:59:55Z) - Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness [6.634133253472436]
This paper introduces a new instruction-following dataset tailored for dynamic facial expression caption.
The dataset comprises 5,033 high-quality video clips annotated manually, containing over 700,000 tokens.
We also present FEC-Bench, a benchmark designed to assess the performance of existing video MLLMs in this specific task.
arXiv Detail & Related papers (2025-01-14T09:52:56Z) - Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings [69.35226485836641]
Excessive use of visual tokens in existing Multimoal Large Language Models (MLLMs) often exhibits obvious redundancy and brings in prohibitively expensive computation.
We propose a simple yet effective method to improve the efficiency of MLLMs, termed dynamic visual-token exit (DyVTE)
DyVTE uses lightweight hyper-networks to perceive the text token status and decide the removal of all visual tokens after a certain layer.
arXiv Detail & Related papers (2024-11-29T11:24:23Z) - MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs [61.56904387052982]
This paper proposes a new visual grounding task called multi-context visual grounding.
It aims to localize instances of interest across multiple images based on open-ended text prompts.
We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities.
arXiv Detail & Related papers (2024-10-16T07:52:57Z) - EMO-LLaMA: Enhancing Facial Emotion Understanding with Instruction Tuning [27.790079451103065]
We propose a novel MLLM, named EMO-LLaMA, which incorporates facial priors from a pretrained facial analysis network to enhance human facial information.
EMO-LLaMA achieves SOTA-comparable or competitive results across both static and dynamic FER datasets.
arXiv Detail & Related papers (2024-08-21T08:28:40Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks [89.24440488456405]
VisionLLM v2 is an end-to-end generalist multimodal large model (MLLM)
It unifies visual perception, understanding, and generation within a single framework.
arXiv Detail & Related papers (2024-06-12T16:44:50Z) - Facial Affective Behavior Analysis with Instruction Tuning [58.332959295770614]
Facial affective behavior analysis (FABA) is crucial for understanding human mental states from images.
Traditional approaches primarily deploy models to discriminate among discrete emotion categories, and lack the fine granularity and reasoning capability for complex facial behaviors.
We introduce an instruction-following dataset for two FABA tasks, emotion and action unit recognition, and a benchmark FABA-Bench with a new metric considering both recognition and generation ability.
We also introduce a facial prior expert module with face structure knowledge and a low-rank adaptation module into pre-trained MLLM.
arXiv Detail & Related papers (2024-04-07T19:23:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.