Ego-Vehicle Action Recognition based on Semi-Supervised Contrastive
Learning
- URL: http://arxiv.org/abs/2303.00977v1
- Date: Thu, 2 Mar 2023 05:19:31 GMT
- Title: Ego-Vehicle Action Recognition based on Semi-Supervised Contrastive
Learning
- Authors: Chihiro Noguchi, Toshihiro Tanizawa
- Abstract summary: We show that proper video-to-video distances can be defined by focusing on ego-vehicle actions.
Existing methods based on supervised learning cannot handle videos that do not fall into predefined classes.
We propose a method based on semi-supervised contrastive learning.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, many automobiles have been equipped with cameras, which have
accumulated an enormous amount of video footage of driving scenes. Autonomous
driving demands the highest level of safety, for which even unimaginably rare
driving scenes have to be collected in training data to improve the recognition
accuracy for specific scenes. However, it is prohibitively costly to find very
few specific scenes from an enormous amount of videos. In this article, we show
that proper video-to-video distances can be defined by focusing on ego-vehicle
actions. It is well known that existing methods based on supervised learning
cannot handle videos that do not fall into predefined classes, though they work
well in defining video-to-video distances in the embedding space between
labeled videos. To tackle this problem, we propose a method based on
semi-supervised contrastive learning. We consider two related but distinct
contrastive learning: standard graph contrastive learning and our proposed
SOIA-based contrastive learning. We observe that the latter approach can
provide more sensible video-to-video distances between unlabeled videos. Next,
the effectiveness of our method is quantified by evaluating the classification
performance of the ego-vehicle action recognition using HDD dataset, which
shows that our method including unlabeled data in training significantly
outperforms the existing methods using only labeled data in training.
Related papers
- Refining Pre-Trained Motion Models [56.18044168821188]
We take on the challenge of improving state-of-the-art supervised models with self-supervised training.
We focus on obtaining a "clean" training signal from real-world unlabelled video.
We show that our method yields reliable gains over fully-supervised methods in real videos.
arXiv Detail & Related papers (2024-01-01T18:59:33Z) - Any-point Trajectory Modeling for Policy Learning [64.23861308947852]
We introduce Any-point Trajectory Modeling (ATM) to predict future trajectories of arbitrary points within a video frame.
ATM outperforms strong video pre-training baselines by 80% on average.
We show effective transfer learning of manipulation skills from human videos and videos from a different robot morphology.
arXiv Detail & Related papers (2023-12-28T23:34:43Z) - HomE: Homography-Equivariant Video Representation Learning [62.89516761473129]
We propose a novel method for representation learning of multi-view videos.
Our method learns an implicit mapping between different views, culminating in a representation space that maintains the homography relationship between neighboring views.
On action classification, our method obtains 96.4% 3-fold accuracy on the UCF101 dataset, better than most state-of-the-art self-supervised learning methods.
arXiv Detail & Related papers (2023-06-02T15:37:43Z) - Weakly Supervised Two-Stage Training Scheme for Deep Video Fight
Detection Model [0.0]
Fight detection in videos is an emerging deep learning application with today's prevalence of surveillance systems and streaming media.
Previous work has largely relied on action recognition techniques to tackle this problem.
We design the fight detection model as a composition of an action-aware feature extractor and an anomaly score generator.
arXiv Detail & Related papers (2022-09-23T08:29:16Z) - Enabling Weakly-Supervised Temporal Action Localization from On-Device
Learning of the Video Stream [5.215681853828831]
We propose an efficient video learning approach to learn from a long, untrimmed streaming video.
To the best of our knowledge, we are the first attempt to directly learn from the on-device, long video stream.
arXiv Detail & Related papers (2022-08-25T13:41:03Z) - Semi-TCL: Semi-Supervised Track Contrastive Representation Learning [40.31083437957288]
We design a new instance-to-track matching objective to learn appearance embedding.
It compares a candidate detection to the embedding of the tracks persisted in the tracker.
We implement this learning objective in a unified form following the spirit of constrastive loss.
arXiv Detail & Related papers (2021-07-06T05:23:30Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Semi-Supervised Action Recognition with Temporal Contrastive Learning [50.08957096801457]
We learn a two-pathway temporal contrastive model using unlabeled videos at two different speeds.
We considerably outperform video extensions of sophisticated state-of-the-art semi-supervised image recognition methods.
arXiv Detail & Related papers (2021-02-04T17:28:35Z) - RSPNet: Relative Speed Perception for Unsupervised Video Representation
Learning [100.76672109782815]
We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only.
It is difficult to construct a suitable self-supervised task to well model both motion and appearance features.
We propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels.
arXiv Detail & Related papers (2020-10-27T16:42:50Z) - Learning Spatiotemporal Features via Video and Text Pair Discrimination [30.64670449131973]
Cross-modal pair (CPD) framework captures correlation between video and its associated text.
We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (-300k) to demonstrate its effectiveness.
arXiv Detail & Related papers (2020-01-16T08:28:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.