SDFE-LV: A Large-Scale, Multi-Source, and Unconstrained Database for
Spotting Dynamic Facial Expressions in Long Videos
- URL: http://arxiv.org/abs/2209.08445v1
- Date: Sun, 18 Sep 2022 01:59:12 GMT
- Title: SDFE-LV: A Large-Scale, Multi-Source, and Unconstrained Database for
Spotting Dynamic Facial Expressions in Long Videos
- Authors: Xiaolin Xu, Yuan Zong, Wenming Zheng, Yang Li, Chuangao Tang, Xingxun
Jiang, Haolin Jiang
- Abstract summary: SDFE-LV consists of 1,191 long videos, each of which contains one or more complete dynamic facial expressions.
Each complete dynamic facial expression in its corresponding long video was independently labeled for five times by 10 well-trained annotators.
- Score: 21.7199719907133
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present a large-scale, multi-source, and unconstrained
database called SDFE-LV for spotting the onset and offset frames of a complete
dynamic facial expression from long videos, which is known as the topic of
dynamic facial expression spotting (DFES) and a vital prior step for lots of
facial expression analysis tasks. Specifically, SDFE-LV consists of 1,191 long
videos, each of which contains one or more complete dynamic facial expressions.
Moreover, each complete dynamic facial expression in its corresponding long
video was independently labeled for five times by 10 well-trained annotators.
To the best of our knowledge, SDFE-LV is the first unconstrained large-scale
database for the DFES task whose long videos are collected from multiple
real-world/closely real-world media sources, e.g., TV interviews,
documentaries, movies, and we-media short videos. Therefore, DFES tasks on
SDFE-LV database will encounter numerous difficulties in practice such as head
posture changes, occlusions, and illumination. We also provided a comprehensive
benchmark evaluation from different angles by using lots of recent
state-of-the-art deep spotting methods and hence researchers interested in DFES
can quickly and easily get started. Finally, with the deep discussions on the
experimental evaluation results, we attempt to point out several meaningful
directions to deal with DFES tasks and hope that DFES can be better advanced in
the future. In addition, SDFE-LV will be freely released for academic use only
as soon as possible.
Related papers
- The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval [0.0]
Video-language tasks necessitate spatial and temporal comprehension and require significant compute.
This work demonstrates the surprising effectiveness of leveraging image-text pretrained MLLMs for moment retrieval.
We achieve a new state-of-the-art in moment retrieval on the widely used benchmarks Charades-STA, QVHighlights, and ActivityNet Captions.
arXiv Detail & Related papers (2024-06-26T06:59:09Z) - Short Film Dataset (SFD): A Benchmark for Story-Level Video Understanding [30.06191555110948]
We propose the Short Film dataset with 1,078 publicly available amateur movies.
Our experiments emphasize the need for long-term reasoning to solve SFD tasks.
We show significantly lower performance of current models compared to people when using vision data alone.
arXiv Detail & Related papers (2024-06-14T17:54:54Z) - Scaling Up Video Summarization Pretraining with Large Language Models [73.74662411006426]
We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset.
We analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them.
Our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals.
arXiv Detail & Related papers (2024-04-04T11:59:06Z) - DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent) [73.10899129264375]
This paper explores DoraemonGPT, a comprehensive and conceptually elegant system driven by LLMs to understand dynamic scenes.
Given a video with a question/task, DoraemonGPT begins by converting the input video into a symbolic memory that stores task-related attributes.
We extensively evaluate DoraemonGPT's effectiveness on three benchmarks and several in-the-wild scenarios.
arXiv Detail & Related papers (2024-01-16T14:33:09Z) - SVFAP: Self-supervised Video Facial Affect Perceiver [42.16505961654868]
Self-supervised Video Facial Affect Perceiver (SVFAP)
This paper introduces a self-supervised approach, termed Self-supervised Video Facial Affect Perceiver (SVFAP)
To verify the effectiveness of our method, we conduct experiments on nine datasets spanning three downstream tasks, including dynamic facial expression recognition, dimensional emotion recognition, and personality recognition.
Comprehensive results demonstrate that SVFAP can learn powerful affect-related representations via large-scale self-supervised pre-training and it significantly outperforms previous state-of-the-art methods on all datasets.
arXiv Detail & Related papers (2023-12-31T07:44:05Z) - Good Questions Help Zero-Shot Image Reasoning [110.1671684828904]
Question-Driven Visual Exploration (QVix) is a novel prompting strategy that enhances the exploratory capabilities of large vision-language models (LVLMs)
QVix enables a wider exploration of visual scenes, improving the LVLMs' reasoning accuracy and depth in tasks such as visual question answering and visual entailment.
Our evaluations on various challenging zero-shot vision-language benchmarks, including ScienceQA and fine-grained visual classification, demonstrate that QVix significantly outperforms existing methods.
arXiv Detail & Related papers (2023-12-04T03:18:51Z) - Video-based Person Re-identification with Long Short-Term Representation
Learning [101.62570747820541]
Video-based person Re-Identification (V-ReID) aims to retrieve specific persons from raw videos captured by non-overlapped cameras.
We propose a novel deep learning framework named Long Short-Term Representation Learning (LSTRL) for effective V-ReID.
arXiv Detail & Related papers (2023-08-07T16:22:47Z) - Unmasked Teacher: Towards Training-Efficient Video Foundation Models [50.19560876891811]
Video Foundation Models (VFMs) have received limited exploration due to high computational costs and data scarcity.
This paper proposes a training-efficient method for temporal-sensitive VFMs that integrates the benefits of existing methods.
Our model can handle various tasks including scene-related, temporal-related, and complex video-language understanding.
arXiv Detail & Related papers (2023-03-28T15:39:28Z) - Occluded Video Instance Segmentation: Dataset and ICCV 2021 Challenge [133.80567761430584]
We collect a large-scale dataset called OVIS for video instance segmentation in the occluded scenario.
OVIS consists of 296k high-quality instance masks and 901 occluded scenes.
All baseline methods encounter a significant performance degradation of about 80% in the heavily occluded object group.
arXiv Detail & Related papers (2021-11-15T17:59:03Z) - DFEW: A Large-Scale Database for Recognizing Dynamic Facial Expressions
in the Wild [22.305429904593126]
We present a new large-scale 'in-the-wild' dynamic facial expression database, DFEW, consisting of over 16,000 video clips from thousands of movies.
Second, we propose a novel method called Expression-Clustered Spatiotemporal Feature Learning framework to deal with dynamic FER in the wild.
Third, we conduct extensive benchmark experiments on DFEW using a lot of deep feature learning methods as well as our proposed EC-STFL.
arXiv Detail & Related papers (2020-08-13T14:10:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.