Related papers: Improving Video Retrieval by Adaptive Margin

Improving Video Retrieval by Adaptive Margin

URL: http://arxiv.org/abs/2303.05093v1
Date: Thu, 9 Mar 2023 08:07:38 GMT
Title: Improving Video Retrieval by Adaptive Margin
Authors: Feng He, Qi Wang, Zhifan Feng, Wenbin Jiang, Yajuan Lv, Yong zhu, Xiao Tan
Abstract summary: The dominant paradigm for video retrieval learns video-text representations by pushing the distance between the similarity of positive pairs and that of negative pairs apart from a fixed margin. Negative pairs used for training are sampled randomly, which indicates that the semantics between negative pairs may be related or even equivalent. We propose an adaptive margin changed with the distance between positive and negative pairs to solve the aforementioned issue.
Score: 18.326296132847332
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video retrieval is becoming increasingly important owing to the rapid emergence of videos on the Internet. The dominant paradigm for video retrieval learns video-text representations by pushing the distance between the similarity of positive pairs and that of negative pairs apart from a fixed margin. However, negative pairs used for training are sampled randomly, which indicates that the semantics between negative pairs may be related or even equivalent, while most methods still enforce dissimilar representations to decrease their similarity. This phenomenon leads to inaccurate supervision and poor performance in learning video-text representations. While most video retrieval methods overlook that phenomenon, we propose an adaptive margin changed with the distance between positive and negative pairs to solve the aforementioned issue. First, we design the calculation framework of the adaptive margin, including the method of distance measurement and the function between the distance and the margin. Then, we explore a novel implementation called "Cross-Modal Generalized Self-Distillation" (CMGSD), which can be built on the top of most video retrieval models with few modifications. Notably, CMGSD adds few computational overheads at train time and adds no computational overhead at test time. Experimental results on three widely used datasets demonstrate that the proposed method can yield significantly better performance than the corresponding backbone model, and it outperforms state-of-the-art methods by a large margin.

Related papers

Learning from Streaming Video with Orthogonal Gradients [62.51504086522027]
We address the challenge of representation learning from a continuous stream of video as input, in a self-supervised manner. This differs from the standard approaches to video learning where videos are chopped and shuffled during training in order to create a non-redundant batch. We demonstrate the drop in performance when moving from shuffled to sequential learning on three tasks.
arXiv Detail & Related papers (2025-04-02T17:59:57Z)
Multi-Pair Temporal Sentence Grounding via Multi-Thread Knowledge Transfer Network [57.72095897427665]
temporal sentence grounding (TSG) aims to locate query-relevant segments in videos. Previous methods follow a single-thread framework that cannot co-train different pairs. We propose Multi-Pair TSG, which aims to co-train these pairs.
arXiv Detail & Related papers (2024-12-20T08:50:11Z)
Buffer Anytime: Zero-Shot Video Depth and Normal from Image Priors [54.8852848659663]
Buffer Anytime is a framework for estimation of depth and normal maps (which we call geometric buffers) from video. We demonstrate high-quality video buffer estimation by leveraging single-image priors with temporal consistency constraints.
arXiv Detail & Related papers (2024-11-26T09:28:32Z)
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning [35.404100473539195]
Text-video retrieval aims to rank relevant text/video higher than irrelevant ones. Recent contrastive learning methods have shown promising results for text-video retrieval. This paper improves contrastive learning using two novel techniques.
arXiv Detail & Related papers (2023-09-20T06:08:11Z)
Mitigating Representation Bias in Action Recognition: Algorithms and Benchmarks [76.35271072704384]
Deep learning models perform poorly when applied to videos with rare scenes or objects. We tackle this problem from two different angles: algorithm and dataset. We show that the debiased representation can generalize better when transferred to other datasets and tasks.
arXiv Detail & Related papers (2022-09-20T00:30:35Z)
Efficient Video Segmentation Models with Per-frame Inference [117.97423110566963]
We focus on improving the temporal consistency without introducing overhead in inference. We propose several techniques to learn from the video sequence, including a temporal consistency loss and online/offline knowledge distillation methods.
arXiv Detail & Related papers (2022-02-24T23:51:36Z)
A Closer Look at Few-Shot Video Classification: A New Baseline and Benchmark [33.86872697028233]
We present an in-depth study on few-shot video classification by making three contributions. First, we perform a consistent comparative study on the existing metric-based methods to figure out their limitations in representation learning. Second, we discover that there is a high correlation between the novel action class and the ImageNet object class, which is problematic in the few-shot recognition setting. Third, we present a new benchmark with more base data to facilitate future few-shot video classification without pre-training.
arXiv Detail & Related papers (2021-10-24T06:01:46Z)
End-to-End Dense Video Grounding via Parallel Regression [30.984657885692553]
Video grounding aims to localize the corresponding video moment in an untrimmed video given a language query. We present an end-to-end parallel decoding paradigm by re-purposing a Transformer-alike architecture (PRVG) Thanks to its simplicity in design, our PRVG framework can be applied in different testing schemes.
arXiv Detail & Related papers (2021-09-23T10:03:32Z)
ASCNet: Self-supervised Video Representation Learning with Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information. Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other. In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z)
Composable Augmentation Encoding for Video Representation Learning [94.2358972764708]
We focus on contrastive methods for self-supervised video representation learning. A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives. We propose an 'augmentation aware' contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations. We show that our method encodes valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of video benchmarks.
arXiv Detail & Related papers (2021-04-01T16:48:53Z)
Semi-Supervised Action Recognition with Temporal Contrastive Learning [50.08957096801457]
We learn a two-pathway temporal contrastive model using unlabeled videos at two different speeds. We considerably outperform video extensions of sophisticated state-of-the-art semi-supervised image recognition methods.
arXiv Detail & Related papers (2021-02-04T17:28:35Z)
Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework [43.002621928500425]
We propose a self-supervised method to learn feature representations from videos. Because video representation is important, we extend negative samples by introducing intra-negative samples. We conduct experiments on video retrieval and video recognition tasks using the learned video representation.
arXiv Detail & Related papers (2020-08-06T09:08:14Z)
Multiple Instance-Based Video Anomaly Detection using Deep Temporal Encoding-Decoding [5.255783459833821]
We propose a weakly supervised deep temporal encoding-decoding solution for anomaly detection in surveillance videos. The proposed approach uses both abnormal and normal video clips during the training phase. The results show that the proposed method performs similar to or better than the state-of-the-art solutions for anomaly detection in video surveillance applications.
arXiv Detail & Related papers (2020-07-03T08:22:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.