SDFE-LV: A Large-Scale, Multi-Source, and Unconstrained Database for
Spotting Dynamic Facial Expressions in Long Videos
- URL: http://arxiv.org/abs/2209.08445v1
- Date: Sun, 18 Sep 2022 01:59:12 GMT
- Title: SDFE-LV: A Large-Scale, Multi-Source, and Unconstrained Database for
Spotting Dynamic Facial Expressions in Long Videos
- Authors: Xiaolin Xu, Yuan Zong, Wenming Zheng, Yang Li, Chuangao Tang, Xingxun
Jiang, Haolin Jiang
- Abstract summary: SDFE-LV consists of 1,191 long videos, each of which contains one or more complete dynamic facial expressions.
Each complete dynamic facial expression in its corresponding long video was independently labeled for five times by 10 well-trained annotators.
- Score: 21.7199719907133
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present a large-scale, multi-source, and unconstrained
database called SDFE-LV for spotting the onset and offset frames of a complete
dynamic facial expression from long videos, which is known as the topic of
dynamic facial expression spotting (DFES) and a vital prior step for lots of
facial expression analysis tasks. Specifically, SDFE-LV consists of 1,191 long
videos, each of which contains one or more complete dynamic facial expressions.
Moreover, each complete dynamic facial expression in its corresponding long
video was independently labeled for five times by 10 well-trained annotators.
To the best of our knowledge, SDFE-LV is the first unconstrained large-scale
database for the DFES task whose long videos are collected from multiple
real-world/closely real-world media sources, e.g., TV interviews,
documentaries, movies, and we-media short videos. Therefore, DFES tasks on
SDFE-LV database will encounter numerous difficulties in practice such as head
posture changes, occlusions, and illumination. We also provided a comprehensive
benchmark evaluation from different angles by using lots of recent
state-of-the-art deep spotting methods and hence researchers interested in DFES
can quickly and easily get started. Finally, with the deep discussions on the
experimental evaluation results, we attempt to point out several meaningful
directions to deal with DFES tasks and hope that DFES can be better advanced in
the future. In addition, SDFE-LV will be freely released for academic use only
as soon as possible.
Related papers
- DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects [84.73092715537364]
In this paper, we study a new task of navigating to diverse target objects in a large number of scene types.
We build an end-to-end embodied agent, NatVLM, by fine-tuning a Large Vision Language Model (LVLM) through imitation learning.
Our agent achieves a success rate that surpasses GPT-4o by over 20%.
arXiv Detail & Related papers (2024-10-03T17:49:28Z) - Short Film Dataset (SFD): A Benchmark for Story-Level Video Understanding [30.06191555110948]
We propose the Short Film dataset with 1,078 publicly available amateur movies.
Our experiments emphasize the need for long-term reasoning to solve SFD tasks.
We show significantly lower performance of current models compared to people when using vision data alone.
arXiv Detail & Related papers (2024-06-14T17:54:54Z) - CinePile: A Long Video Question Answering Dataset and Benchmark [55.30860239555001]
We present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding.
Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects.
We fine-tuned open-source Video-LLMs on the training split and evaluated both open-source and proprietary video-centric LLMs on the test split of our dataset.
arXiv Detail & Related papers (2024-05-14T17:59:02Z) - Scaling Up Video Summarization Pretraining with Large Language Models [73.74662411006426]
We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset.
We analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them.
Our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals.
arXiv Detail & Related papers (2024-04-04T11:59:06Z) - DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent) [73.10899129264375]
This paper explores DoraemonGPT, a comprehensive and conceptually elegant system driven by LLMs to understand dynamic scenes.
Given a video with a question/task, DoraemonGPT begins by converting the input video into a symbolic memory that stores task-related attributes.
We extensively evaluate DoraemonGPT's effectiveness on three benchmarks and several in-the-wild scenarios.
arXiv Detail & Related papers (2024-01-16T14:33:09Z) - SVFAP: Self-supervised Video Facial Affect Perceiver [42.16505961654868]
Motivated by the recent success of self-supervised learning in computer vision, this paper introduces a self-supervised approach, termed Self-supervised Video Facial Affect Perceiver (SVFAP)
To address the dilemma faced by supervised methods, SVFAP leverages masked video autoencoding to perform self-supervised pre-training on massive unlabeled facial videos.
To verify the effectiveness of our method, we conduct experiments on nine datasets spanning three downstream tasks, including dynamic facial expression recognition, dimensional emotion recognition, and personality recognition.
arXiv Detail & Related papers (2023-12-31T07:44:05Z) - Video-based Person Re-identification with Long Short-Term Representation
Learning [101.62570747820541]
Video-based person Re-Identification (V-ReID) aims to retrieve specific persons from raw videos captured by non-overlapped cameras.
We propose a novel deep learning framework named Long Short-Term Representation Learning (LSTRL) for effective V-ReID.
arXiv Detail & Related papers (2023-08-07T16:22:47Z) - Unmasked Teacher: Towards Training-Efficient Video Foundation Models [50.19560876891811]
Video Foundation Models (VFMs) have received limited exploration due to high computational costs and data scarcity.
This paper proposes a training-efficient method for temporal-sensitive VFMs that integrates the benefits of existing methods.
Our model can handle various tasks including scene-related, temporal-related, and complex video-language understanding.
arXiv Detail & Related papers (2023-03-28T15:39:28Z) - DFEW: A Large-Scale Database for Recognizing Dynamic Facial Expressions
in the Wild [22.305429904593126]
We present a new large-scale 'in-the-wild' dynamic facial expression database, DFEW, consisting of over 16,000 video clips from thousands of movies.
Second, we propose a novel method called Expression-Clustered Spatiotemporal Feature Learning framework to deal with dynamic FER in the wild.
Third, we conduct extensive benchmark experiments on DFEW using a lot of deep feature learning methods as well as our proposed EC-STFL.
arXiv Detail & Related papers (2020-08-13T14:10:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.