Unsupervised Contrastive Learning of Image Representations from
Ultrasound Videos with Hard Negative Mining
- URL: http://arxiv.org/abs/2207.13148v1
- Date: Tue, 26 Jul 2022 19:00:33 GMT
- Title: Unsupervised Contrastive Learning of Image Representations from
Ultrasound Videos with Hard Negative Mining
- Authors: Soumen Basu, Somanshu Singla, Mayank Gupta, Pratyaksha Rana, Pankaj
Gupta, Chetan Arora
- Abstract summary: State-of-the-art (SOTA) contrastive learning techniques consider frames within a video as positives in the embedding space.
We observe that unlike multiple views of an object in natural scene videos, an Ultrasound (US) video captures different 2D slices of an organ.
We propose to instead utilize such frames as hard negatives to learn rich image representations.
- Score: 16.49278694957565
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Rich temporal information and variations in viewpoints make video data an
attractive choice for learning image representations using unsupervised
contrastive learning (UCL) techniques. State-of-the-art (SOTA) contrastive
learning techniques consider frames within a video as positives in the
embedding space, whereas the frames from other videos are considered negatives.
We observe that unlike multiple views of an object in natural scene videos, an
Ultrasound (US) video captures different 2D slices of an organ. Hence, there is
almost no similarity between the temporally distant frames of even the same US
video. In this paper we propose to instead utilize such frames as hard
negatives. We advocate mining both intra-video and cross-video negatives in a
hardness-sensitive negative mining curriculum in a UCL framework to learn rich
image representations. We deploy our framework to learn the representations of
Gallbladder (GB) malignancy from US videos. We also construct the first
large-scale US video dataset containing 64 videos and 15,800 frames for
learning GB representations. We show that the standard ResNet50 backbone
trained with our framework improves the accuracy of models pretrained with SOTA
UCL techniques as well as supervised pretrained models on ImageNet for the GB
malignancy detection task by 2-6%. We further validate the generalizability of
our method on a publicly available lung US image dataset of COVID-19
pathologies and show an improvement of 1.5% compared to SOTA. Source code,
dataset, and models are available at https://gbc-iitd.github.io/usucl.
Related papers
- Let Video Teaches You More: Video-to-Image Knowledge Distillation using DEtection TRansformer for Medical Video Lesion Detection [91.97935118185]
We propose Video-to-Image knowledge distillation for the task of medical video lesion detection.
By distilling multi-frame contexts into a single frame, the proposed V2I-DETR combines the advantages of utilizing temporal contexts from video-based models and the inference speed of image-based models.
V2I-DETR outperforms previous state-of-the-art methods by a large margin while achieving the real-time inference speed (30 FPS) as the image-based model.
arXiv Detail & Related papers (2024-08-26T07:17:05Z) - AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction [88.70116693750452]
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction.
Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task.
We introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions.
arXiv Detail & Related papers (2024-06-10T17:02:08Z) - Frozen CLIP Models are Efficient Video Learners [86.73871814176795]
Video recognition has been dominated by the end-to-end learning paradigm.
Recent advances in Contrastive Vision-Language Pre-training pave the way for a new route for visual recognition tasks.
We present Efficient Video Learning -- an efficient framework for directly training high-quality video recognition models.
arXiv Detail & Related papers (2022-08-06T17:38:25Z) - Contrastive Learning of Image Representations with Cross-Video
Cycle-Consistency [13.19476138523546]
Cross-video relation has barely been explored for visual representation learning.
We propose a novel contrastive learning method which explores the cross-video relation by using cycle-consistency for general image representation learning.
We show significant improvement over state-of-the-art contrastive learning methods.
arXiv Detail & Related papers (2021-05-13T17:59:11Z) - Broaden Your Views for Self-Supervised Video Learning [97.52216510672251]
We introduce BraVe, a self-supervised learning framework for video.
In BraVe, one of the views has access to a narrow temporal window of the video while the other view has a broad access to the video content.
We demonstrate that BraVe achieves state-of-the-art results in self-supervised representation learning on standard video and audio classification benchmarks.
arXiv Detail & Related papers (2021-03-30T17:58:46Z) - Spatiotemporal Contrastive Video Representation Learning [87.56145031149869]
We present a self-supervised Contrastive Video Representation Learning (CVRL) method to learn visual representations from unlabeled videos.
Our representations are learned using a contrasttemporalive loss, where two augmented clips from the same short video are pulled together in the embedding space.
We study what makes for good data augmentations for video self-supervised learning and find that both spatial and temporal information are crucial.
arXiv Detail & Related papers (2020-08-09T19:58:45Z) - Self-supervised Video Representation Learning Using Inter-intra
Contrastive Framework [43.002621928500425]
We propose a self-supervised method to learn feature representations from videos.
Because video representation is important, we extend negative samples by introducing intra-negative samples.
We conduct experiments on video retrieval and video recognition tasks using the learned video representation.
arXiv Detail & Related papers (2020-08-06T09:08:14Z) - Watching the World Go By: Representation Learning from Unlabeled Videos [78.22211989028585]
Recent single image unsupervised representation learning techniques show remarkable success on a variety of tasks.
In this paper, we argue that videos offer this natural augmentation for free.
We propose Video Noise Contrastive Estimation, a method for using unlabeled video to learn strong, transferable single image representations.
arXiv Detail & Related papers (2020-03-18T00:07:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.