ParamCrop: Parametric Cubic Cropping for Video Contrastive Learning
- URL: http://arxiv.org/abs/2108.10501v1
- Date: Tue, 24 Aug 2021 03:18:12 GMT
- Title: ParamCrop: Parametric Cubic Cropping for Video Contrastive Learning
- Authors: Zhiwu Qing, Ziyuan Huang, Shiwei Zhang, Mingqian Tang, Changxin Gao,
Marcelo H. Ang Jr, Rong Ji, Nong Sang
- Abstract summary: We present a parametric cubic cropping operation, ParamCrop, for video contrastive learning.
ParamCrop is trained simultaneously with the video backbone using an adversarial objective and learns an optimal cropping strategy from the data.
visualizations show that the center distance and the IoU between two augmented views are adaptively controlled by ParamCrop.
- Score: 35.577788907544964
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The central idea of contrastive learning is to discriminate between different
instances and force different views of the same instance to share the same
representation. To avoid trivial solutions, augmentation plays an important
role in generating different views, among which random cropping is shown to be
effective for the model to learn a strong and generalized representation.
Commonly used random crop operation keeps the difference between two views
statistically consistent along the training process. In this work, we challenge
this convention by showing that adaptively controlling the disparity between
two augmented views along the training process enhances the quality of the
learnt representation. Specifically, we present a parametric cubic cropping
operation, ParamCrop, for video contrastive learning, which automatically crops
a 3D cubic from the video by differentiable 3D affine transformations.
ParamCrop is trained simultaneously with the video backbone using an
adversarial objective and learns an optimal cropping strategy from the data.
The visualizations show that the center distance and the IoU between two
augmented views are adaptively controlled by ParamCrop and the learned change
in the disparity along the training process is beneficial to learning a strong
representation. Extensive ablation studies demonstrate the effectiveness of the
proposed ParamCrop on multiple contrastive learning frameworks and video
backbones. With ParamCrop, we improve the state-of-the-art performance on both
HMDB51 and UCF101 datasets.
Related papers
- Cohere3D: Exploiting Temporal Coherence for Unsupervised Representation
Learning of Vision-based Autonomous Driving [73.3702076688159]
We propose a novel contrastive learning algorithm, Cohere3D, to learn coherent instance representations in a long-term input sequence.
We evaluate our algorithm by finetuning the pretrained model on various downstream perception, prediction, and planning tasks.
arXiv Detail & Related papers (2024-02-23T19:43:01Z) - DVANet: Disentangling View and Action Features for Multi-View Action
Recognition [56.283944756315066]
We present a novel approach to multi-view action recognition where we guide learned action representations to be separated from view-relevant information in a video.
Our model and method of training significantly outperforms all other uni-modal models on four multi-view action recognition datasets.
arXiv Detail & Related papers (2023-12-10T01:19:48Z) - Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Contrastive Learning of Image Representations with Cross-Video
Cycle-Consistency [13.19476138523546]
Cross-video relation has barely been explored for visual representation learning.
We propose a novel contrastive learning method which explores the cross-video relation by using cycle-consistency for general image representation learning.
We show significant improvement over state-of-the-art contrastive learning methods.
arXiv Detail & Related papers (2021-05-13T17:59:11Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Composable Augmentation Encoding for Video Representation Learning [94.2358972764708]
We focus on contrastive methods for self-supervised video representation learning.
A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives.
We propose an 'augmentation aware' contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations.
We show that our method encodes valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of video benchmarks.
arXiv Detail & Related papers (2021-04-01T16:48:53Z) - Self-Supervised Learning via multi-Transformation Classification for
Action Recognition [10.676377556393527]
We introduce a self-supervised video representation learning method based on the multi-transformation classification to efficiently classify human actions.
The representation of the video is learned in a self-supervised manner by classifying seven different transformations.
We have conducted the experiments on UCF101 and HMDB51 datasets together with C3D and 3D Resnet-18 as backbone networks.
arXiv Detail & Related papers (2021-02-20T16:11:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.