Related papers: Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness

Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness

URL: http://arxiv.org/abs/2501.07978v1
Date: Tue, 14 Jan 2025 09:52:56 GMT
Title: Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness
Authors: Jiaxing Zhao, Boyuan Sun, Xiang Chen, Xihan Wei,
Abstract summary: This paper introduces a new instruction-following dataset tailored for dynamic facial expression caption.<n>The dataset comprises 5,033 high-quality video clips annotated manually, containing over 700,000 tokens.<n>We also present FEC-Bench, a benchmark designed to assess the performance of existing video MLLMs in this specific task.
Score: 6.634133253472436
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Facial expression captioning has found widespread application across various domains. Recently, the emergence of video Multimodal Large Language Models (MLLMs) has shown promise in general video understanding tasks. However, describing facial expressions within videos poses two major challenges for these models: (1) the lack of adequate datasets and benchmarks, and (2) the limited visual token capacity of video MLLMs. To address these issues, this paper introduces a new instruction-following dataset tailored for dynamic facial expression caption. The dataset comprises 5,033 high-quality video clips annotated manually, containing over 700,000 tokens. Its purpose is to improve the capability of video MLLMs to discern subtle facial nuances. Furthermore, we propose FaceTrack-MM, which leverages a limited number of tokens to encode the main character's face. This model demonstrates superior performance in tracking faces and focusing on the facial expressions of the main characters, even in intricate multi-person scenarios. Additionally, we introduce a novel evaluation metric combining event extraction, relation classification, and the longest common subsequence (LCS) algorithm to assess the content consistency and temporal sequence consistency of generated text. Moreover, we present FEC-Bench, a benchmark designed to assess the performance of existing video MLLMs in this specific task. All data and source code will be made publicly available.

Related papers

FaceLLM: A Multimodal Large Language Model for Face Understanding [22.8742248559748]
We introduce FaceLLM, a multimodal large language model trained specifically for facial image understanding.<n>To construct the training data, we propose a novel weakly supervised pipeline that uses ChatGPT with attribute-aware prompts to generate high-quality question-answer pairs.<n>Our experiments demonstrate that FaceLLM improves the performance of MLLMs on various face-centric tasks and achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-07-14T14:04:14Z)
Video Summarization with Large Language Models [41.51242348081083]
We propose a new video summarization framework that leverages the capabilities of recent Large Language Models (LLMs) Our method, dubbed LLM-based Video Summarization (LLMVS), translates video frames into a sequence of captions using a Muti-modal Large Language Model (MLLM) Our experimental results demonstrate the superiority of the proposed method over existing ones in standard benchmarks.
arXiv Detail & Related papers (2025-04-15T13:56:14Z)
FaVChat: Unlocking Fine-Grained Facial Video Understanding with Multimodal Large Language Models [12.029771909598647]
FaVChat is the first VMLLM specifically designed for fine-grained facial video understanding. We construct a large-scale facial video dataset comprising over 60k videos, with the majority annotated with 83 fine-grained facial attributes. We employ a progressive training paradigm, transitioning from video summarization to a high-quality subset of video QA, gradually increasing task complexity to enhance the model's fine-grained visual perception.
arXiv Detail & Related papers (2025-03-12T08:33:46Z)
CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval [24.203328970223527]
We present CaReBench, a testing benchmark for fine-grained video captioning and retrieval. Uniquely, it provides manually separated spatial annotations and temporal annotations for each video. Based on this design, we introduce two evaluation metrics, ReBias and CapST, specifically tailored for video retrieval and video captioning tasks.
arXiv Detail & Related papers (2024-12-31T15:53:50Z)
HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLMs with Synthetic Benchmark Data [55.739633494946204]
We present HumanVBench, an innovative benchmark meticulously crafted to bridge gaps in the evaluation of video MLLMs.<n>HumanVBench comprises 17 carefully designed tasks that explore two primary dimensions: inner emotion and outer manifestations, spanning static and dynamic, basic and complex, as well as single-modal and cross-modal aspects.
arXiv Detail & Related papers (2024-12-23T13:45:56Z)
Face-MLLM: A Large Face Perception Model [53.9441375205716]
multimodal large language models (MLLMs) have achieved promising results on a wide range of vision-language tasks, but their ability to perceive and understand human faces is rarely explored. In this work, we comprehensively evaluate existing MLLMs on face perception tasks. Our model surpasses previous MLLMs on five famous face perception tasks.
arXiv Detail & Related papers (2024-10-28T04:19:32Z)
FineCLIPER: Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs [5.35588281968644]
We propose a novel framework, named Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs (Fine CLIPER) Our Fine CLIPER achieves tunable SOTA performance on the DFEW, FERV39k, and MAFW datasets with few parameters.
arXiv Detail & Related papers (2024-07-02T10:55:43Z)
The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval [36.516226519328015]
Video-language tasks necessitate spatial and temporal comprehension and require significant compute. This work demonstrates the surprising effectiveness of leveraging image-text pretrained MLLMs for moment retrieval. We achieve a new state-of-the-art in moment retrieval on the widely used benchmarks Charades-STA, QVHighlights, and ActivityNet Captions.
arXiv Detail & Related papers (2024-06-26T06:59:09Z)
Towards Semantic Equivalence of Tokenization in Multimodal LLM [149.11720372278273]
Vision tokenization is essential for semantic alignment between vision and language. This paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok) SeTok groups visual features into semantic units via a dynamic clustering algorithm. The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features.
arXiv Detail & Related papers (2024-06-07T17:55:43Z)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video. In this paper, we address such limitations in video pre-training with an efficient video decomposition. Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z)
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation [122.63617171522316]
Large Language Models (LLMs) are the dominant models for generative tasks in language. We introduce MAGVIT-v2, a video tokenizer designed to generate concise and expressive tokens for both videos and images.
arXiv Detail & Related papers (2023-10-09T14:10:29Z)
Identity-Aware Multi-Sentence Video Description [105.13845996039277]
We introduce an auxiliary task of Fill-in the Identity, that aims to predict persons' IDs consistently within a set of clips. One of the key components is a gender-aware textual representation as well an additional gender prediction objective in the main model. Experiments show that our proposed Fill-in the Identity model is superior to several baselines and recent works.
arXiv Detail & Related papers (2020-08-22T09:50:43Z)
YouMakeup VQA Challenge: Towards Fine-grained Action Understanding in Domain-Specific Videos [60.62475495522428]
The goal of the YouMakeup VQA Challenge 2020 is to provide a common benchmark for fine-grained action understanding in domain-specific videos. We propose two novel question-answering tasks to evaluate models' fine-grained action understanding abilities.
arXiv Detail & Related papers (2020-04-12T09:25:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.