Improving Video Retrieval by Adaptive Margin
- URL: http://arxiv.org/abs/2303.05093v1
- Date: Thu, 9 Mar 2023 08:07:38 GMT
- Title: Improving Video Retrieval by Adaptive Margin
- Authors: Feng He, Qi Wang, Zhifan Feng, Wenbin Jiang, Yajuan Lv, Yong zhu, Xiao
Tan
- Abstract summary: The dominant paradigm for video retrieval learns video-text representations by pushing the distance between the similarity of positive pairs and that of negative pairs apart from a fixed margin.
Negative pairs used for training are sampled randomly, which indicates that the semantics between negative pairs may be related or even equivalent.
We propose an adaptive margin changed with the distance between positive and negative pairs to solve the aforementioned issue.
- Score: 18.326296132847332
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video retrieval is becoming increasingly important owing to the rapid
emergence of videos on the Internet. The dominant paradigm for video retrieval
learns video-text representations by pushing the distance between the
similarity of positive pairs and that of negative pairs apart from a fixed
margin. However, negative pairs used for training are sampled randomly, which
indicates that the semantics between negative pairs may be related or even
equivalent, while most methods still enforce dissimilar representations to
decrease their similarity. This phenomenon leads to inaccurate supervision and
poor performance in learning video-text representations.
While most video retrieval methods overlook that phenomenon, we propose an
adaptive margin changed with the distance between positive and negative pairs
to solve the aforementioned issue. First, we design the calculation framework
of the adaptive margin, including the method of distance measurement and the
function between the distance and the margin. Then, we explore a novel
implementation called "Cross-Modal Generalized Self-Distillation" (CMGSD),
which can be built on the top of most video retrieval models with few
modifications. Notably, CMGSD adds few computational overheads at train time
and adds no computational overhead at test time. Experimental results on three
widely used datasets demonstrate that the proposed method can yield
significantly better performance than the corresponding backbone model, and it
outperforms state-of-the-art methods by a large margin.
Related papers
- Buffer Anytime: Zero-Shot Video Depth and Normal from Image Priors [54.8852848659663]
Buffer Anytime is a framework for estimation of depth and normal maps (which we call geometric buffers) from video.
We demonstrate high-quality video buffer estimation by leveraging single-image priors with temporal consistency constraints.
arXiv Detail & Related papers (2024-11-26T09:28:32Z) - Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial
Margin Contrastive Learning [35.404100473539195]
Text-video retrieval aims to rank relevant text/video higher than irrelevant ones.
Recent contrastive learning methods have shown promising results for text-video retrieval.
This paper improves contrastive learning using two novel techniques.
arXiv Detail & Related papers (2023-09-20T06:08:11Z) - Mitigating Representation Bias in Action Recognition: Algorithms and
Benchmarks [76.35271072704384]
Deep learning models perform poorly when applied to videos with rare scenes or objects.
We tackle this problem from two different angles: algorithm and dataset.
We show that the debiased representation can generalize better when transferred to other datasets and tasks.
arXiv Detail & Related papers (2022-09-20T00:30:35Z) - Efficient Video Segmentation Models with Per-frame Inference [117.97423110566963]
We focus on improving the temporal consistency without introducing overhead in inference.
We propose several techniques to learn from the video sequence, including a temporal consistency loss and online/offline knowledge distillation methods.
arXiv Detail & Related papers (2022-02-24T23:51:36Z) - A Closer Look at Few-Shot Video Classification: A New Baseline and
Benchmark [33.86872697028233]
We present an in-depth study on few-shot video classification by making three contributions.
First, we perform a consistent comparative study on the existing metric-based methods to figure out their limitations in representation learning.
Second, we discover that there is a high correlation between the novel action class and the ImageNet object class, which is problematic in the few-shot recognition setting.
Third, we present a new benchmark with more base data to facilitate future few-shot video classification without pre-training.
arXiv Detail & Related papers (2021-10-24T06:01:46Z) - End-to-End Dense Video Grounding via Parallel Regression [30.984657885692553]
Video grounding aims to localize the corresponding video moment in an untrimmed video given a language query.
We present an end-to-end parallel decoding paradigm by re-purposing a Transformer-alike architecture (PRVG)
Thanks to its simplicity in design, our PRVG framework can be applied in different testing schemes.
arXiv Detail & Related papers (2021-09-23T10:03:32Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Composable Augmentation Encoding for Video Representation Learning [94.2358972764708]
We focus on contrastive methods for self-supervised video representation learning.
A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives.
We propose an 'augmentation aware' contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations.
We show that our method encodes valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of video benchmarks.
arXiv Detail & Related papers (2021-04-01T16:48:53Z) - Semi-Supervised Action Recognition with Temporal Contrastive Learning [50.08957096801457]
We learn a two-pathway temporal contrastive model using unlabeled videos at two different speeds.
We considerably outperform video extensions of sophisticated state-of-the-art semi-supervised image recognition methods.
arXiv Detail & Related papers (2021-02-04T17:28:35Z) - Self-supervised Video Representation Learning Using Inter-intra
Contrastive Framework [43.002621928500425]
We propose a self-supervised method to learn feature representations from videos.
Because video representation is important, we extend negative samples by introducing intra-negative samples.
We conduct experiments on video retrieval and video recognition tasks using the learned video representation.
arXiv Detail & Related papers (2020-08-06T09:08:14Z) - Multiple Instance-Based Video Anomaly Detection using Deep Temporal
Encoding-Decoding [5.255783459833821]
We propose a weakly supervised deep temporal encoding-decoding solution for anomaly detection in surveillance videos.
The proposed approach uses both abnormal and normal video clips during the training phase.
The results show that the proposed method performs similar to or better than the state-of-the-art solutions for anomaly detection in video surveillance applications.
arXiv Detail & Related papers (2020-07-03T08:22:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.