Supervised Video Summarization via Multiple Feature Sets with Parallel
Attention
- URL: http://arxiv.org/abs/2104.11530v1
- Date: Fri, 23 Apr 2021 10:46:35 GMT
- Title: Supervised Video Summarization via Multiple Feature Sets with Parallel
Attention
- Authors: Junaid Ahmed Ghauri, Sherzod Hakimov, Ralph Ewerth
- Abstract summary: We suggest a novel model architecture that combines three feature sets for visual content and motion to predict importance scores.
The proposed architecture utilizes an attention mechanism before fusing motion features and features representing the (static) visual content.
Comprehensive experimental evaluations are reported for two well-known datasets, SumMe and TVSum.
- Score: 4.931399476945033
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The assignment of importance scores to particular frames or (short) segments
in a video is crucial for summarization, but also a difficult task. Previous
work utilizes only one source of visual features. In this paper, we suggest a
novel model architecture that combines three feature sets for visual content
and motion to predict importance scores. The proposed architecture utilizes an
attention mechanism before fusing motion features and features representing the
(static) visual content, i.e., derived from an image classification model.
Comprehensive experimental evaluations are reported for two well-known
datasets, SumMe and TVSum. In this context, we identify methodological issues
on how previous work used these benchmark datasets, and present a fair
evaluation scheme with appropriate data splits that can be used in future work.
When using static and motion features with parallel attention mechanism, we
improve state-of-the-art results for SumMe, while being on par with the state
of the art for the other dataset.
Related papers
- Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z) - HighlightMe: Detecting Highlights from Human-Centric Videos [62.265410865423]
We present a domain- and user-preference-agnostic approach to detect highlightable excerpts from human-centric videos.
We use an autoencoder network equipped with spatial-temporal graph convolutions to detect human activities and interactions.
We observe a 4-12% improvement in the mean average precision of matching the human-annotated highlights over state-of-the-art methods.
arXiv Detail & Related papers (2021-10-05T01:18:15Z) - Unsupervised Video Summarization via Multi-source Features [4.387757291346397]
Video summarization aims at generating a compact yet representative visual summary that conveys the essence of the original video.
We propose the incorporation of multiple feature sources with chunk and stride fusion to provide more information about the visual content.
For a comprehensive evaluation on the two benchmarks TVSum and SumMe, we compare our method with four state-of-the-art approaches.
arXiv Detail & Related papers (2021-05-26T13:12:46Z) - Auto-weighted Multi-view Feature Selection with Graph Optimization [90.26124046530319]
We propose a novel unsupervised multi-view feature selection model based on graph learning.
The contributions are threefold: (1) during the feature selection procedure, the consensus similarity graph shared by different views is learned.
Experiments on various datasets demonstrate the superiority of the proposed method compared with the state-of-the-art methods.
arXiv Detail & Related papers (2021-04-11T03:25:25Z) - Unified Graph Structured Models for Video Understanding [93.72081456202672]
We propose a message passing graph neural network that explicitly models relational-temporal relations.
We show how our method is able to more effectively model relationships between relevant entities in the scene.
arXiv Detail & Related papers (2021-03-29T14:37:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.