MINI-Net: Multiple Instance Ranking Network for Video Highlight
Detection
- URL: http://arxiv.org/abs/2007.09833v2
- Date: Thu, 13 Aug 2020 05:42:05 GMT
- Title: MINI-Net: Multiple Instance Ranking Network for Video Highlight
Detection
- Authors: Fa-Ting Hong, Xuanteng Huang, Wei-Hong Li, and Wei-Shi Zheng
- Abstract summary: We propose casting weakly supervised video highlight detection modeling for a given specific event as a multiple instance ranking network (MINI-Net) learning.
MINI-Net learns to enforce a higher highlight score for a positive bag that contains highlight segments of a specific event than those for negative bags that are irrelevant.
- Score: 71.02649475990889
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We address the weakly supervised video highlight detection problem for
learning to detect segments that are more attractive in training videos given
their video event label but without expensive supervision of manually
annotating highlight segments. While manually averting localizing highlight
segments, weakly supervised modeling is challenging, as a video in our daily
life could contain highlight segments with multiple event types, e.g., skiing
and surfing. In this work, we propose casting weakly supervised video highlight
detection modeling for a given specific event as a multiple instance ranking
network (MINI-Net) learning. We consider each video as a bag of segments, and
therefore, the proposed MINI-Net learns to enforce a higher highlight score for
a positive bag that contains highlight segments of a specific event than those
for negative bags that are irrelevant. In particular, we form a max-max ranking
loss to acquire a reliable relative comparison between the most likely positive
segment instance and the hardest negative segment instance. With this max-max
ranking loss, our MINI-Net effectively leverages all segment information to
acquire a more distinct video feature representation for localizing the
highlight segments of a specific event in a video. The extensive experimental
results on three challenging public benchmarks clearly validate the efficacy of
our multiple instance ranking approach for solving the problem.
Related papers
- Unsupervised Video Highlight Detection by Learning from Audio and Visual Recurrence [13.2968942989609]
We focus on unsupervised video highlight detection, eliminating the need for manual annotations.
Through a clustering technique, we identify pseudo-categories of videos and compute audio pseudo-highlight scores for each video.
We also compute visual pseudo-highlight scores for each video using visual features.
arXiv Detail & Related papers (2024-07-18T23:09:14Z) - ViLLa: Video Reasoning Segmentation with Large Language Model [48.75470418596875]
We propose a new video segmentation task - video reasoning segmentation.
The task is designed to output tracklets of segmentation masks given a complex input text query.
We present ViLLa: Video reasoning segmentation with a Large Language Model.
arXiv Detail & Related papers (2024-07-18T17:59:17Z) - What is Point Supervision Worth in Video Instance Segmentation? [119.71921319637748]
Video instance segmentation (VIS) is a challenging vision task that aims to detect, segment, and track objects in videos.
We reduce the human annotations to only one point for each object in a video frame during training, and obtain high-quality mask predictions close to fully supervised models.
Comprehensive experiments on three VIS benchmarks demonstrate competitive performance of the proposed framework, nearly matching fully supervised methods.
arXiv Detail & Related papers (2024-04-01T17:38:25Z) - VideoCutLER: Surprisingly Simple Unsupervised Video Instance
Segmentation [87.13210748484217]
VideoCutLER is a simple method for unsupervised multi-instance video segmentation without using motion-based learning signals like optical flow or training on natural videos.
We show the first competitive unsupervised learning results on the challenging YouTubeVIS 2019 benchmark, achieving 50.7% APvideo50.
VideoCutLER can also serve as a strong pretrained model for supervised video instance segmentation tasks, exceeding DINO by 15.9% on YouTubeVIS 2019 in terms of APvideo.
arXiv Detail & Related papers (2023-08-28T17:10:12Z) - Multi-modal Segment Assemblage Network for Ad Video Editing with
Importance-Coherence Reward [34.06878258459702]
Advertisement video editing aims to automatically edit advertising videos into shorter videos while retaining coherent content and crucial information conveyed by advertisers.
Existing method performs well at video segmentation stages but suffers from dependencies on extra cumbersome models and poor performance at segment assemblage stage.
We propose M-SAN which can perform efficient and coherent segment assemblage task end-to-end.
arXiv Detail & Related papers (2022-09-25T06:51:45Z) - Reliable Shot Identification for Complex Event Detection via
Visual-Semantic Embedding [72.9370352430965]
We propose a visual-semantic guided loss method for event detection in videos.
Motivated by curriculum learning, we introduce a negative elastic regularization term to start training the classifier with instances of high reliability.
An alternative optimization algorithm is developed to solve the proposed challenging non-net regularization problem.
arXiv Detail & Related papers (2021-10-12T11:46:56Z) - Cross-category Video Highlight Detection via Set-based Learning [55.49267044910344]
We propose a Dual-Learner-based Video Highlight Detection (DL-VHD) framework.
It learns the distinction of target category videos and the characteristics of highlight moments on source video category.
It outperforms five typical Unsupervised Domain Adaptation (UDA) algorithms on various cross-category highlight detection tasks.
arXiv Detail & Related papers (2021-08-26T13:06:47Z) - Hybrid Dynamic-static Context-aware Attention Network for Action
Assessment in Long Videos [96.45804577283563]
We present a novel hybrid dynAmic-static Context-aware attenTION NETwork (ACTION-NET) for action assessment in long videos.
We learn the video dynamic information but also focus on the static postures of the detected athletes in specific frames.
We combine the features of the two streams to regress the final video score, supervised by ground-truth scores given by experts.
arXiv Detail & Related papers (2020-08-13T15:51:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.