Related papers: Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval

Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval

URL: http://arxiv.org/abs/2503.19009v1
Date: Mon, 24 Mar 2025 17:51:29 GMT
Title: Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval
Authors: Arun Reddy, Alexander Martin, Eugene Yang, Andrew Yates, Kate Sanders, Kenton Murray, Reno Kriz, Celso M. de Melo, Benjamin Van Durme, Rama Chellappa,
Abstract summary: Video-ColBERT introduces a simple and efficient mechanism for fine-grained similarity assessment between queries and videos.<n>We find that this interaction and training paradigm leads to strong individual, yet compatible, representations for encoding video content.<n>These representations lead to increases in performance on common text-to-video retrieval benchmarks compared to other bi-encoder methods.
Score: 90.72791786676753
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this work, we tackle the problem of text-to-video retrieval (T2VR). Inspired by the success of late interaction techniques in text-document, text-image, and text-video retrieval, our approach, Video-ColBERT, introduces a simple and efficient mechanism for fine-grained similarity assessment between queries and videos. Video-ColBERT is built upon 3 main components: a fine-grained spatial and temporal token-wise interaction, query and visual expansions, and a dual sigmoid loss during training. We find that this interaction and training paradigm leads to strong individual, yet compatible, representations for encoding video content. These representations lead to increases in performance on common text-to-video retrieval benchmarks compared to other bi-encoder methods.

Related papers

NAVERO: Unlocking Fine-Grained Semantics for Video-Language Compositionality [52.08735848128973]
We study the capability of Video-Language (VidL) models in understanding compositions between objects, attributes, actions and their relations. We propose a training method called NAVERO which utilizes video-text data augmented with negative texts to enhance composition understanding.
arXiv Detail & Related papers (2024-08-18T15:27:06Z)
SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval [11.548061962976321]
We propose a novel Syntax-Hierarchy-Enhanced text-video retrieval method (SHE-Net)<n>First, to facilitate a more fine-grained integration of visual content, we employ the text syntax hierarchy, which reveals the grammatical structure of text descriptions.<n>Second, to further enhance the multi-modal interaction and alignment, we also utilize the syntax hierarchy to guide the similarity calculation.
arXiv Detail & Related papers (2024-04-22T10:23:59Z)
In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval [72.98185525653504]
We propose a new setting, text-video retrieval with uncurated & unpaired data, that during training utilizes only text queries together with uncurated web videos. To improve generalization, we show that one model can be trained with multiple text styles. We evaluate our model on retrieval performance over multiple datasets to demonstrate the advantages of our style transfer framework.
arXiv Detail & Related papers (2023-09-16T08:48:21Z)
Fine-grained Text-Video Retrieval with Frozen Image Encoders [10.757101644990273]
We propose CrossTVR, a two-stage text-video retrieval architecture. In the first stage, we leverage existing TVR methods with cosine similarity network for efficient text/video candidate selection. In the second stage, we propose a novel decoupled video text cross attention module to capture fine-grained multimodal information in spatial and temporal dimensions.
arXiv Detail & Related papers (2023-07-14T02:57:00Z)
Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos [39.42509966219001]
This paper studies weakly supervised sequential video understanding where the accurate time-level text-video alignment is not provided. We use a transformer to aggregate frame-level features for video representation and use a pre-trained text encoder to encode the texts corresponding to each action and the whole video. Experiments on video sequence verification and text-to-video matching show that our method outperforms baselines by a large margin.
arXiv Detail & Related papers (2023-03-22T08:13:25Z)
Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment. Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description. Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z)
Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval [9.537322316673617]
We investigate how to optimally combine multiple diverse textual and visual features into feature pairs. To learn these representations our proposed network architecture is trained by following a multiple space learning procedure.
arXiv Detail & Related papers (2022-11-21T11:08:13Z)
Contrastive Video-Language Learning with Fine-grained Frame Sampling [54.542962813921214]
FineCo is an approach to better learn video and language representations with a fine-grained contrastive objective operating on video frames. It helps distil a video by selecting the frames that are semantically equivalent to the text, improving cross-modal correspondence.
arXiv Detail & Related papers (2022-10-10T22:48:08Z)
BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions [38.843518809230524]
We introduce a novel pretext task dubbed Multiple Choice Questions (MCQ) A module BridgeFormer is trained to answer the "questions" constructed by the text features via resorting to the video features. In the form of questions and answers, the semantic associations between local video-text features can be properly established. Our method outperforms state-of-the-art methods on the popular text-to-video retrieval task in five datasets.
arXiv Detail & Related papers (2022-01-13T09:33:54Z)
Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query. We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR. Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.