Related papers: FaVChat: Unlocking Fine-Grained Facial Video Understanding with Multimodal Large Language Models

FaVChat: Unlocking Fine-Grained Facial Video Understanding with Multimodal Large Language Models

URL: http://arxiv.org/abs/2503.09158v2
Date: Thu, 13 Mar 2025 10:45:03 GMT
Title: FaVChat: Unlocking Fine-Grained Facial Video Understanding with Multimodal Large Language Models
Authors: Fufangchen Zhao, Ming Li, Linrui Xu, Wenhao Jiang, Jian Gao, Danfeng Yan,
Abstract summary: FaVChat is the first VMLLM specifically designed for fine-grained facial video understanding.<n>We construct a large-scale facial video dataset comprising over 60k videos, with the majority annotated with 83 fine-grained facial attributes.<n>We employ a progressive training paradigm, transitioning from video summarization to a high-quality subset of video QA, gradually increasing task complexity to enhance the model's fine-grained visual perception.
Score: 12.029771909598647
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video-based multimodal large language models (VMLLMs) have demonstrated remarkable potential in cross-modal video understanding. However, their abilities in fine-grained face comprehension remain largely underexplored. Given its pivotal role in human-centric intelligence, developing VMLLMs for facial understanding holds a fundamental problem. To address this gap, we propose FaVChat, the first VMLLM specifically designed for fine-grained facial video understanding. To facilitate its training, we construct a large-scale facial video dataset comprising over 60k videos, with the majority annotated with 83 fine-grained facial attributes. These attributes are incorporated to enrich GPT-4o-generated captions, yielding 60k high-quality video-summary pairs and an additional 170k fine-grained question-answering (QA) pairs. To effectively capture rich facial clues, we propose a hybrid model architecture composed of a general visual encoder, a dedicated facial encoder, and a mixture-of-experts-enhanced adapter for adaptive fusion of multi-source visual features. To mitigate information loss during feature transformation, we extract multi-granularity representations from the facial encoder and integrate them into the subsequent LLM. This design enhances the model's ability to comprehend and respond to questions involving diverse levels of visual details. We employ a progressive training paradigm, transitioning from video summarization to a high-quality subset of video QA, gradually increasing task complexity to enhance the model's fine-grained visual perception. We conduct extensive zero-shot evaluation on a couple of public benchmarks, demonstrating that FaVChat consistently surpasses existing VMLLMs across multiple tasks.

Related papers

UFVideo: Towards Unified Fine-Grained Video Cooperative Understanding with Large Language Models [35.952441992916235]
We introduce UFVideo, the first Video LLM with unified multi-grained cooperative understanding capabilities.<n>We design unified visual-language guided alignment to flexibly handle video understanding across global, pixel and temporal scales within a single model.<n>We construct the UFVideo-Bench consisting of three distinct collaborative tasks within the scales, which demonstrates UFVideo's flexibility and advantages over GPT-4o.
arXiv Detail & Related papers (2025-12-12T07:17:42Z)
VideoPerceiver: Enhancing Fine-Grained Temporal Perception in Video Multimodal Large Language Models [9.896951371033229]
VideoPerceiver is a video multimodal large language model (VMLLM) that enhances fine-grained perception in video understanding.<n>We construct "key-information-missing" videos by extracting event-action keywords from captions, identifying corresponding key frames, and replacing them with adjacent frames.<n> Experiments show VideoPerceiver substantially outperforms state-of-the-art VMLLMs on fine-grained action understanding and rare event captioning benchmarks.
arXiv Detail & Related papers (2025-11-24T06:57:26Z)
Reasoning-Enhanced Domain-Adaptive Pretraining of Multimodal Large Language Models for Short Video Content Governance [34.134289344567705]
We propose a reasoning-enhanced multimodal large language model (MLLM) pretraining paradigm for unified inappropriate content detection.<n>To address the distribution gap between short video content and the original pretraining data of MLLMs, we introduce three targeted pretraining tasks.<n> Experimental results show that our pretraining approach significantly improves the MLLM's performance in both zero-shot and supervised fine-tuning settings.
arXiv Detail & Related papers (2025-09-25T19:46:34Z)
Omni-Video: Democratizing Unified Video Understanding and Generation [13.616454543808798]
This report presents Omni-Video, an efficient and effective unified framework for video understanding, generation, as well as instruction-based editing.<n>Our key insight is to teach existing multimodal large language models (MLLMs) to produce continuous visual clues that are used as the input of diffusion decoders.<n>To fully unlock the potential of our system for unified video modeling, we integrate several technical improvements.
arXiv Detail & Related papers (2025-07-08T16:02:16Z)
CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance [34.345125922868]
We propose CINEMA, a novel framework for coherent multi-subject video generation by leveraging Multimodal Large Language Model (MLLM) Our approach eliminates the need for explicit correspondences between subject images and text entities, mitigating ambiguity and reducing annotation effort. Our framework can be conditioned on varying numbers of subjects, offering greater flexibility in personalized content creation.
arXiv Detail & Related papers (2025-03-13T14:07:58Z)
Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness [6.634133253472436]
This paper introduces a new instruction-following dataset tailored for dynamic facial expression caption.<n>The dataset comprises 5,033 high-quality video clips annotated manually, containing over 700,000 tokens.<n>We also present FEC-Bench, a benchmark designed to assess the performance of existing video MLLMs in this specific task.
arXiv Detail & Related papers (2025-01-14T09:52:56Z)
VideoCogQA: A Controllable Benchmark for Evaluating Cognitive Abilities in Video-Language Models [19.215440092652507]
Large Video-Language Models (LVLMs) have led to promising results in multimodal video understanding.<n>It remains unclear whether these models possess the cognitive capabilities required for high-level tasks, particularly those involving symbolic and abstract perception.<n>We propose VideoCogQA, a scalable and fully controllable benchmark inspired by game-world environments.<n>By generating synthetic videos via a programmatic engine, VideoCogQA allows fine-grained control over visual elements, temporal dynamics, and task difficulty.
arXiv Detail & Related papers (2024-11-14T00:26:26Z)
LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models [53.64461404882853]
Video quality assessment (VQA) algorithms are needed to monitor and optimize the quality of streaming videos. Here, we propose the first Large Multi-Modal Video Quality Assessment (LMM-VQA) model, which introduces a novel visual modeling strategy for quality-aware feature extraction.
arXiv Detail & Related papers (2024-08-26T04:29:52Z)
LLMs Meet Long Video: Advancing Long Video Question Answering with An Interactive Visual Adapter in LLMs [22.696090318037925]
Long video understanding is a significant and ongoing challenge in the intersection of multimedia and artificial intelligence. We present an Interactive Visual Adapter (IVA) within large language models (LLMs) to enhance interaction with fine-grained visual elements.
arXiv Detail & Related papers (2024-02-21T05:56:52Z)
Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion [70.9767518332692]
Multimodal Large Language Models (MLLMs) that incorporate LLMs with pre-trained vision models have recently demonstrated impressive performance across diverse vision-language tasks. However, they fall short to comprehend context involving multiple images. We propose a two phase paradigm, browse-and-concentrate, to enable in-depth multimodal context fusion.
arXiv Detail & Related papers (2024-02-19T14:59:07Z)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video. In this paper, we address such limitations in video pre-training with an efficient video decomposition. Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z)
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information. At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings. At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z)
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark [63.14000659130736]
We introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench. We first introduce a novel static-to-dynamic method to define these temporal-related tasks. Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task.
arXiv Detail & Related papers (2023-11-28T17:59:04Z)
Let's Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought [62.619076257298204]
We motivate framing video reasoning as the sequential understanding of a small number of video reasonings. We introduce VIP, an inference-time challenge dataset designed to explore models' reasoning capabilities through video chain-of-thought. We benchmark GPT-4, GPT-3, and VICUNA on VIP, demonstrate the performance gap in complex video reasoning tasks, and encourage future work.
arXiv Detail & Related papers (2023-05-23T10:26:42Z)
A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video. Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding. To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.