Face-MLLM: A Large Face Perception Model
- URL: http://arxiv.org/abs/2410.20717v1
- Date: Mon, 28 Oct 2024 04:19:32 GMT
- Title: Face-MLLM: A Large Face Perception Model
- Authors: Haomiao Sun, Mingjie He, Tianheng Lian, Hu Han, Shiguang Shan,
- Abstract summary: multimodal large language models (MLLMs) have achieved promising results on a wide range of vision-language tasks, but their ability to perceive and understand human faces is rarely explored.
In this work, we comprehensively evaluate existing MLLMs on face perception tasks.
Our model surpasses previous MLLMs on five famous face perception tasks.
- Score: 53.9441375205716
- License:
- Abstract: Although multimodal large language models (MLLMs) have achieved promising results on a wide range of vision-language tasks, their ability to perceive and understand human faces is rarely explored. In this work, we comprehensively evaluate existing MLLMs on face perception tasks. The quantitative results reveal that existing MLLMs struggle to handle these tasks. The primary reason is the lack of image-text datasets that contain fine-grained descriptions of human faces. To tackle this problem, we design a practical pipeline for constructing datasets, upon which we further build a novel multimodal large face perception model, namely Face-MLLM. Specifically, we re-annotate LAION-Face dataset with more detailed face captions and facial attribute labels. Besides, we re-formulate traditional face datasets using the question-answer style, which is fit for MLLMs. Together with these enriched datasets, we develop a novel three-stage MLLM training method. In the first two stages, our model learns visual-text alignment and basic visual question answering capability, respectively. In the third stage, our model learns to handle multiple specialized face perception tasks. Experimental results show that our model surpasses previous MLLMs on five famous face perception tasks. Besides, on our newly introduced zero-shot facial attribute analysis task, our Face-MLLM also presents superior performance.
Related papers
- MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs [61.56904387052982]
This paper proposes a new visual grounding task called multi-context visual grounding.
It aims to localize instances of interest across multiple images based on open-ended text prompts.
We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities.
arXiv Detail & Related papers (2024-10-16T07:52:57Z) - EMO-LLaMA: Enhancing Facial Emotion Understanding with Instruction Tuning [27.790079451103065]
We propose a novel MLLM, named EMO-LLaMA, which incorporates facial priors from a pretrained facial analysis network to enhance human facial information.
EMO-LLaMA achieves SOTA-comparable or competitive results across both static and dynamic FER datasets.
arXiv Detail & Related papers (2024-08-21T08:28:40Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks [89.24440488456405]
VisionLLM v2 is an end-to-end generalist multimodal large model (MLLM)
It unifies visual perception, understanding, and generation within a single framework.
arXiv Detail & Related papers (2024-06-12T16:44:50Z) - Enhancing Zero-Shot Facial Expression Recognition by LLM Knowledge Transfer [40.47880613758304]
We propose a novel method, Exp-CLIP, to enhance zero-shot FER by transferring the task knowledge from large language models (LLMs)
Specifically, based on the pre-trained vision-language encoders, we incorporate a projection head designed to map the initial joint vision-language space into a space that captures representations of facial actions.
Given unlabelled facial data, Exp-CLIP achieves superior zero-shot results to the CLIP models and several other large vision-language models (LVLMs) on seven in-the-wild FER datasets.
arXiv Detail & Related papers (2024-05-29T14:06:09Z) - Facial Affective Behavior Analysis with Instruction Tuning [58.332959295770614]
Facial affective behavior analysis (FABA) is crucial for understanding human mental states from images.
Traditional approaches primarily deploy models to discriminate among discrete emotion categories, and lack the fine granularity and reasoning capability for complex facial behaviors.
We introduce an instruction-following dataset for two FABA tasks, emotion and action unit recognition, and a benchmark FABA-Bench with a new metric considering both recognition and generation ability.
We also introduce a facial prior expert module with face structure knowledge and a low-rank adaptation module into pre-trained MLLM.
arXiv Detail & Related papers (2024-04-07T19:23:28Z) - Mementos: A Comprehensive Benchmark for Multimodal Large Language Model
Reasoning over Image Sequences [80.54979242912944]
This paper introduces Mementos, a new benchmark designed to assess MLLMs' sequential image reasoning abilities.
We find that MLLMs struggle to accurately describe dynamic information about given image sequences, often leading to hallucinations/misrepresentations of objects.
arXiv Detail & Related papers (2024-01-19T07:10:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.