Learning Trailer Moments in Full-Length Movies
- URL: http://arxiv.org/abs/2008.08502v1
- Date: Wed, 19 Aug 2020 15:23:25 GMT
- Title: Learning Trailer Moments in Full-Length Movies
- Authors: Lezi Wang, Dong Liu, Rohit Puri, and Dimitris N. Metaxas
- Abstract summary: We leverage the officially-released trailers as the weak supervision to learn a model that can detect the key moments from full-length movies.
We introduce a novel ranking network that utilizes the Co-Attention between movies and trailers as guidance to generate the training pairs.
We construct the first movie-trailer dataset, and the proposed Co-Attention assisted ranking network shows superior performance even over the supervised approach.
- Score: 49.74693903050302
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A movie's key moments stand out of the screenplay to grab an audience's
attention and make movie browsing efficient. But a lack of annotations makes
the existing approaches not applicable to movie key moment detection. To get
rid of human annotations, we leverage the officially-released trailers as the
weak supervision to learn a model that can detect the key moments from
full-length movies. We introduce a novel ranking network that utilizes the
Co-Attention between movies and trailers as guidance to generate the training
pairs, where the moments highly corrected with trailers are expected to be
scored higher than the uncorrelated moments. Additionally, we propose a
Contrastive Attention module to enhance the feature representations such that
the comparative contrast between features of the key and non-key moments are
maximized. We construct the first movie-trailer dataset, and the proposed
Co-Attention assisted ranking network shows superior performance even over the
supervised approach. The effectiveness of our Contrastive Attention module is
also demonstrated by the performance improvement over the state-of-the-art on
the public benchmarks.
Related papers
- No More Shortcuts: Realizing the Potential of Temporal Self-Supervision [69.59938105887538]
We propose a more challenging reformulation of temporal self-supervision as frame-level (rather than clip-level) recognition tasks.
We demonstrate experimentally that our more challenging frame-level task formulations and the removal of shortcuts drastically improve the quality of features learned through temporal self-supervision.
arXiv Detail & Related papers (2023-12-20T13:20:31Z) - Self-supervised and Weakly Supervised Contrastive Learning for
Frame-wise Action Representations [26.09611987412578]
We introduce a new framework of contrastive action representation learning (CARL) to learn frame-wise action representation in a self-supervised or weakly-supervised manner.
Specifically, we introduce a simple but effective video encoder that considers both spatial and temporal context.
Our method outperforms previous state-of-the-art by a large margin for downstream fine-grained action classification and even faster inference.
arXiv Detail & Related papers (2022-12-06T16:42:22Z) - Less than Few: Self-Shot Video Instance Segmentation [50.637278655763616]
We propose to automatically learn to find appropriate support videos given a query.
We tackle, for the first time, video instance segmentation in a self-shot (and few-shot) setting.
We provide strong baseline performances that utilize a novel transformer-based model.
arXiv Detail & Related papers (2022-04-19T13:14:43Z) - Film Trailer Generation via Task Decomposition [65.16768855902268]
We model movies as graphs, where nodes are shots and edges denote semantic relations between them.
We learn these relations using joint contrastive training which leverages privileged textual information from screenplays.
An unsupervised algorithm then traverses the graph and generates trailers that human judges prefer to ones generated by competitive supervised approaches.
arXiv Detail & Related papers (2021-11-16T20:50:52Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.