Use of Affective Visual Information for Summarization of Human-Centric
  Videos
        - URL: http://arxiv.org/abs/2107.03783v1
- Date: Thu, 8 Jul 2021 11:46:04 GMT
- Title: Use of Affective Visual Information for Summarization of Human-Centric
  Videos
- Authors: Berkay K\"opr\"u, Engin Erzin
- Abstract summary: We investigate the affective-information enriched supervised video summarization task for human-centric videos.
First, we train a visual input-driven state-of-the-art continuous emotion recognition model (CER-NET) on the RECOLA dataset to estimate emotional attributes.
Then, we integrate the estimated emotional attributes and the high-level representations from the CER-NET with the visual information to define the proposed affective video summarization architectures (AVSUM)
- Score: 13.273989782771556
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   Increasing volume of user-generated human-centric video content and their
applications, such as video retrieval and browsing, require compact
representations that are addressed by the video summarization literature.
Current supervised studies formulate video summarization as a
sequence-to-sequence learning problem and the existing solutions often neglect
the surge of human-centric view, which inherently contains affective content.
In this study, we investigate the affective-information enriched supervised
video summarization task for human-centric videos. First, we train a visual
input-driven state-of-the-art continuous emotion recognition model (CER-NET) on
the RECOLA dataset to estimate emotional attributes. Then, we integrate the
estimated emotional attributes and the high-level representations from the
CER-NET with the visual information to define the proposed affective video
summarization architectures (AVSUM). In addition, we investigate the use of
attention to improve the AVSUM architectures and propose two new architectures
based on temporal attention (TA-AVSUM) and spatial attention (SA-AVSUM). We
conduct video summarization experiments on the TvSum database. The proposed
AVSUM-GRU architecture with an early fusion of high level GRU embeddings and
the temporal attention based TA-AVSUM architecture attain competitive video
summarization performances by bringing strong performance improvements for the
human-centric videos compared to the state-of-the-art in terms of F-score and
self-defined face recall metrics.
 
      
        Related papers
        - TRIM: A Self-Supervised Video Summarization Framework Maximizing   Temporal Relative Information and Representativeness [9.374702244811303]
 We introduce a self-supervised video summarization model that captures both spatial and temporal dependencies without the overhead of attention, RNNs, or transformers.<n>Our framework integrates a novel set of Markov process-driven loss metrics and a two-stage self supervised learning paradigm that ensures both performance and efficiency.
 arXiv  Detail & Related papers  (2025-06-25T16:27:38Z)
- Generative Ghost: Investigating Ranking Bias Hidden in AI-Generated   Videos [106.5804660736763]
 Video information retrieval remains a fundamental approach for accessing video content.
We build on the observation that retrieval models often favor AI-generated content in ad-hoc and image retrieval tasks.
We investigate whether similar biases emerge in the context of challenging video retrieval.
 arXiv  Detail & Related papers  (2025-02-11T07:43:47Z)
- Query-centric Audio-Visual Cognition Network for Moment Retrieval,   Segmentation and Step-Captioning [56.873534081386]
 A new topic HIREST is presented, including video retrieval, moment retrieval, moment segmentation, and step-captioning.
We propose a query-centric audio-visual cognition network to construct a reliable multi-modal representation for three tasks.
This can cognize user-preferred content and thus attain a query-centric audio-visual representation for three tasks.
 arXiv  Detail & Related papers  (2024-12-18T06:43:06Z)
- Enhancing Video Summarization with Context Awareness [9.861215740353247]
 Video summarization automatically generate concise summaries by selecting techniques, shots, or segments that capture the video's essence.
Despite the importance of video summarization, there is a lack of diverse and representative datasets.
We propose an unsupervised approach that leverages video data structure and information for generating informative summaries.
 arXiv  Detail & Related papers  (2024-04-06T09:08:34Z)
- Conditional Modeling Based Automatic Video Summarization [70.96973928590958]
 The aim of video summarization is to shorten videos automatically while retaining the key information necessary to convey the overall story.
Video summarization methods rely on visual factors, such as visual consecutiveness and diversity, which may not be sufficient to fully understand the content of the video.
A new approach to video summarization is proposed based on insights gained from how humans create ground truth video summaries.
 arXiv  Detail & Related papers  (2023-11-20T20:24:45Z)
- Affective Video Content Analysis: Decade Review and New Perspectives [4.3569033781023165]
 affective video content analysis (AVCA) as an essential branch of affective computing has become a widely researched topic.
We introduce the widely used emotion representation models in AVCA and describe commonly used datasets.
We discuss future challenges and promising research directions, such as emotion recognition and public opinion analysis.
 arXiv  Detail & Related papers  (2023-10-26T07:56:17Z)
- How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios [73.24092762346095]
 We introduce two large-scale datasets with over 60,000 videos annotated for emotional response and subjective wellbeing.
The Video Cognitive Empathy dataset contains annotations for distributions of fine-grained emotional responses, allowing models to gain a detailed understanding of affective states.
The Video to Valence dataset contains annotations of relative pleasantness between videos, which enables predicting a continuous spectrum of wellbeing.
 arXiv  Detail & Related papers  (2022-10-18T17:58:25Z)
- Video Summarization Based on Video-text Modelling [0.0]
 We propose a multimodal self-supervised learning framework to obtain semantic representations of videos.
We also introduce a progressive video summarization method, where the important content in a video is pinpointed progressively to generate better summaries.
An objective evaluation framework is proposed to measure the quality of video summaries based on video classification.
 arXiv  Detail & Related papers  (2022-01-07T15:21:46Z)
- Weakly Supervised Human-Object Interaction Detection in Video via
  Contrastive Spatiotemporal Regions [81.88294320397826]
 A system does not know what human-object interactions are present in a video as or the actual location of the human and object.
We introduce a dataset comprising over 6.5k videos with human-object interaction that have been curated from sentence captions.
We demonstrate improved performance over weakly supervised baselines adapted to our annotations on our video dataset.
 arXiv  Detail & Related papers  (2021-10-07T15:30:18Z)
- ASOD60K: Audio-Induced Salient Object Detection in Panoramic Videos [79.05486554647918]
 We propose PV-SOD, a new task that aims to segment salient objects from panoramic videos.
In contrast to existing fixation-level or object-level saliency detection tasks, we focus on multi-modal salient object detection (SOD)
We collect the first large-scale dataset, named ASOD60K, which contains 4K-resolution video frames annotated with a six-level hierarchy.
 arXiv  Detail & Related papers  (2021-07-24T15:14:20Z)
- Efficient Video Summarization Framework using EEG and Eye-tracking
  Signals [0.92246583941469]
 This paper proposes an efficient video summarization framework that will give a gist of the entire video in a few key-frames or video skims.
To understand human attention behavior, we have designed and performed experiments with human participants using electroencephalogram (EEG) and eye-tracking technology.
Using our approach, a video is summarized by 96.5% while maintaining higher precision and high recall factors.
 arXiv  Detail & Related papers  (2021-01-27T08:13:19Z)
- Hybrid Dynamic-static Context-aware Attention Network for Action
  Assessment in Long Videos [96.45804577283563]
 We present a novel hybrid dynAmic-static Context-aware attenTION NETwork (ACTION-NET) for action assessment in long videos.
We learn the video dynamic information but also focus on the static postures of the detected athletes in specific frames.
We combine the features of the two streams to regress the final video score, supervised by ground-truth scores given by experts.
 arXiv  Detail & Related papers  (2020-08-13T15:51:42Z)
- Object Relational Graph with Teacher-Recommended Learning for Video
  Captioning [92.48299156867664]
 We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
 arXiv  Detail & Related papers  (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.