Winning the CVPR'2021 Kinetics-GEBD Challenge: Contrastive Learning
Approach
- URL: http://arxiv.org/abs/2106.11549v1
- Date: Tue, 22 Jun 2021 05:21:59 GMT
- Title: Winning the CVPR'2021 Kinetics-GEBD Challenge: Contrastive Learning
Approach
- Authors: Hyolim Kang, Jinwoo Kim, Kyungmin Kim, Taehyun Kim, Seon Joo Kim
- Abstract summary: We introduce a novel contrastive learning based approach to deal with the Generic Event Boundary Detection task.
In our model, Temporal Self-similarity Matrix (TSM) is utilized as an intermediate representation which takes on a role as an information bottleneck.
- Score: 27.904987752334314
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generic Event Boundary Detection (GEBD) is a newly introduced task that aims
to detect "general" event boundaries that correspond to natural human
perception. In this paper, we introduce a novel contrastive learning based
approach to deal with the GEBD. Our intuition is that the feature similarity of
the video snippet would significantly vary near the event boundaries, while
remaining relatively the same in the remaining part of the video. In our model,
Temporal Self-similarity Matrix (TSM) is utilized as an intermediate
representation which takes on a role as an information bottleneck. With our
model, we achieved significant performance boost compared to the given
baselines. Our code is available at
https://github.com/hello-jinwoo/LOVEU-CVPR2021.
Related papers
- Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - Generic Event Boundary Detection in Video with Pyramid Features [12.896848011230523]
Generic event boundary detection (GEBD) aims to split video into chunks at a broad and diverse set of actions as humans naturally perceive event boundaries.
We present an approach that considers the correlation between neighbor frames with pyramid feature maps in both spatial and temporal dimensions.
arXiv Detail & Related papers (2023-01-11T03:29:27Z) - UBoCo : Unsupervised Boundary Contrastive Learning for Generic Event
Boundary Detection [27.29169136392871]
Generic Event Boundary Detection (GEBD) aims to find one level deeper semantic boundaries of events.
We propose a novel framework for unsupervised/supervised GEBD, using the Temporal Self-similarity Matrix (TSM) as the video representation.
Our framework can be applied to both unsupervised and supervised settings, with both achieving state-of-the-art performance by a huge margin.
arXiv Detail & Related papers (2021-11-29T18:50:39Z) - Instance-Level Relative Saliency Ranking with Graph Reasoning [126.09138829920627]
We present a novel unified model to segment salient instances and infer relative saliency rank order.
A novel loss function is also proposed to effectively train the saliency ranking branch.
experimental results demonstrate that our proposed model is more effective than previous methods.
arXiv Detail & Related papers (2021-07-08T13:10:42Z) - Revisiting Contrastive Methods for Unsupervised Learning of Visual
Representations [78.12377360145078]
Contrastive self-supervised learning has outperformed supervised pretraining on many downstream tasks like segmentation and object detection.
In this paper, we first study how biases in the dataset affect existing methods.
We show that current contrastive approaches work surprisingly well across: (i) object- versus scene-centric, (ii) uniform versus long-tailed and (iii) general versus domain-specific datasets.
arXiv Detail & Related papers (2021-06-10T17:59:13Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Generic Event Boundary Detection: A Benchmark for Event Segmentation [21.914662894860474]
This paper presents a novel task together with a new benchmark for detecting generic, taxonomy-free event boundaries that segment a whole video into chunks.
We introduce the task of Generic Event Boundary Detection (GEBD) and the new benchmark Kinetics-GEBD.
Inspired by the cognitive finding that humans mark boundaries at points where they are unable to predict the future accurately, we explore un-supervised approaches.
arXiv Detail & Related papers (2021-01-26T01:31:30Z) - Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed
Videos [82.02074241700728]
In this paper, we present a prohibitive-level action recognition model that is trained with only video-frame labels.
Our method per person detectors have been trained on large image datasets within Multiple Instance Learning framework.
We show how we can apply our method in cases where the standard Multiple Instance Learning assumption, that each bag contains at least one instance with the specified label, is invalid.
arXiv Detail & Related papers (2020-07-21T10:45:05Z) - Not only Look, but also Listen: Learning Multimodal Violence Detection
under Weak Supervision [10.859792341257931]
We first release a large-scale and multi-scene dataset named XD-Violence with a total duration of 217 hours.
We propose a neural network containing three parallel branches to capture different relations among video snippets and integrate features.
Our method outperforms other state-of-the-art methods on our released dataset and other existing benchmark.
arXiv Detail & Related papers (2020-07-09T10:29:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.