Use of Affective Visual Information for Summarization of Human-Centric
Videos
- URL: http://arxiv.org/abs/2107.03783v1
- Date: Thu, 8 Jul 2021 11:46:04 GMT
- Title: Use of Affective Visual Information for Summarization of Human-Centric
Videos
- Authors: Berkay K\"opr\"u, Engin Erzin
- Abstract summary: We investigate the affective-information enriched supervised video summarization task for human-centric videos.
First, we train a visual input-driven state-of-the-art continuous emotion recognition model (CER-NET) on the RECOLA dataset to estimate emotional attributes.
Then, we integrate the estimated emotional attributes and the high-level representations from the CER-NET with the visual information to define the proposed affective video summarization architectures (AVSUM)
- Score: 13.273989782771556
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Increasing volume of user-generated human-centric video content and their
applications, such as video retrieval and browsing, require compact
representations that are addressed by the video summarization literature.
Current supervised studies formulate video summarization as a
sequence-to-sequence learning problem and the existing solutions often neglect
the surge of human-centric view, which inherently contains affective content.
In this study, we investigate the affective-information enriched supervised
video summarization task for human-centric videos. First, we train a visual
input-driven state-of-the-art continuous emotion recognition model (CER-NET) on
the RECOLA dataset to estimate emotional attributes. Then, we integrate the
estimated emotional attributes and the high-level representations from the
CER-NET with the visual information to define the proposed affective video
summarization architectures (AVSUM). In addition, we investigate the use of
attention to improve the AVSUM architectures and propose two new architectures
based on temporal attention (TA-AVSUM) and spatial attention (SA-AVSUM). We
conduct video summarization experiments on the TvSum database. The proposed
AVSUM-GRU architecture with an early fusion of high level GRU embeddings and
the temporal attention based TA-AVSUM architecture attain competitive video
summarization performances by bringing strong performance improvements for the
human-centric videos compared to the state-of-the-art in terms of F-score and
self-defined face recall metrics.
Related papers
- Enhancing Video Summarization with Context Awareness [9.861215740353247]
Video summarization automatically generate concise summaries by selecting techniques, shots, or segments that capture the video's essence.
Despite the importance of video summarization, there is a lack of diverse and representative datasets.
We propose an unsupervised approach that leverages video data structure and information for generating informative summaries.
arXiv Detail & Related papers (2024-04-06T09:08:34Z) - Conditional Modeling Based Automatic Video Summarization [70.96973928590958]
The aim of video summarization is to shorten videos automatically while retaining the key information necessary to convey the overall story.
Video summarization methods rely on visual factors, such as visual consecutiveness and diversity, which may not be sufficient to fully understand the content of the video.
A new approach to video summarization is proposed based on insights gained from how humans create ground truth video summaries.
arXiv Detail & Related papers (2023-11-20T20:24:45Z) - Affective Video Content Analysis: Decade Review and New Perspectives [4.3569033781023165]
affective video content analysis (AVCA) as an essential branch of affective computing has become a widely researched topic.
We introduce the widely used emotion representation models in AVCA and describe commonly used datasets.
We discuss future challenges and promising research directions, such as emotion recognition and public opinion analysis.
arXiv Detail & Related papers (2023-10-26T07:56:17Z) - How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios [73.24092762346095]
We introduce two large-scale datasets with over 60,000 videos annotated for emotional response and subjective wellbeing.
The Video Cognitive Empathy dataset contains annotations for distributions of fine-grained emotional responses, allowing models to gain a detailed understanding of affective states.
The Video to Valence dataset contains annotations of relative pleasantness between videos, which enables predicting a continuous spectrum of wellbeing.
arXiv Detail & Related papers (2022-10-18T17:58:25Z) - Video Summarization Based on Video-text Modelling [0.0]
We propose a multimodal self-supervised learning framework to obtain semantic representations of videos.
We also introduce a progressive video summarization method, where the important content in a video is pinpointed progressively to generate better summaries.
An objective evaluation framework is proposed to measure the quality of video summaries based on video classification.
arXiv Detail & Related papers (2022-01-07T15:21:46Z) - Weakly Supervised Human-Object Interaction Detection in Video via
Contrastive Spatiotemporal Regions [81.88294320397826]
A system does not know what human-object interactions are present in a video as or the actual location of the human and object.
We introduce a dataset comprising over 6.5k videos with human-object interaction that have been curated from sentence captions.
We demonstrate improved performance over weakly supervised baselines adapted to our annotations on our video dataset.
arXiv Detail & Related papers (2021-10-07T15:30:18Z) - ASOD60K: Audio-Induced Salient Object Detection in Panoramic Videos [79.05486554647918]
We propose PV-SOD, a new task that aims to segment salient objects from panoramic videos.
In contrast to existing fixation-level or object-level saliency detection tasks, we focus on multi-modal salient object detection (SOD)
We collect the first large-scale dataset, named ASOD60K, which contains 4K-resolution video frames annotated with a six-level hierarchy.
arXiv Detail & Related papers (2021-07-24T15:14:20Z) - Efficient Video Summarization Framework using EEG and Eye-tracking
Signals [0.92246583941469]
This paper proposes an efficient video summarization framework that will give a gist of the entire video in a few key-frames or video skims.
To understand human attention behavior, we have designed and performed experiments with human participants using electroencephalogram (EEG) and eye-tracking technology.
Using our approach, a video is summarized by 96.5% while maintaining higher precision and high recall factors.
arXiv Detail & Related papers (2021-01-27T08:13:19Z) - Hybrid Dynamic-static Context-aware Attention Network for Action
Assessment in Long Videos [96.45804577283563]
We present a novel hybrid dynAmic-static Context-aware attenTION NETwork (ACTION-NET) for action assessment in long videos.
We learn the video dynamic information but also focus on the static postures of the detected athletes in specific frames.
We combine the features of the two streams to regress the final video score, supervised by ground-truth scores given by experts.
arXiv Detail & Related papers (2020-08-13T15:51:42Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.