Related papers: TokenBinder: Text-Video Retrieval with One-to-Many Alignment Paradigm

TokenBinder: Text-Video Retrieval with One-to-Many Alignment Paradigm

URL: http://arxiv.org/abs/2409.19865v1
Date: Mon, 30 Sep 2024 01:49:37 GMT
Title: TokenBinder: Text-Video Retrieval with One-to-Many Alignment Paradigm
Authors: Bingqing Zhang, Zhuo Cao, Heming Du, Xin Yu, Xue Li, Jiajun Liu, Sen Wang,
Abstract summary: TokenBinder is inspired by Comparative Judgement in human cognitive science. Our method employs a Focused-view Fusion Network with a sophisticated cross-attention mechanism. Our experiments confirm that TokenBinder substantially outperforms existing state-of-the-art methods.
Score: 26.706441126814934
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-Video Retrieval (TVR) methods typically match query-candidate pairs by aligning text and video features in coarse-grained, fine-grained, or combined (coarse-to-fine) manners. However, these frameworks predominantly employ a one(query)-to-one(candidate) alignment paradigm, which struggles to discern nuanced differences among candidates, leading to frequent mismatches. Inspired by Comparative Judgement in human cognitive science, where decisions are made by directly comparing items rather than evaluating them independently, we propose TokenBinder. This innovative two-stage TVR framework introduces a novel one-to-many coarse-to-fine alignment paradigm, imitating the human cognitive process of identifying specific items within a large collection. Our method employs a Focused-view Fusion Network with a sophisticated cross-attention mechanism, dynamically aligning and comparing features across multiple videos to capture finer nuances and contextual variations. Extensive experiments on six benchmark datasets confirm that TokenBinder substantially outperforms existing state-of-the-art methods. These results demonstrate its robustness and the effectiveness of its fine-grained alignment in bridging intra- and inter-modality information gaps in TVR tasks.

Related papers

Implicit Counterfactual Learning for Audio-Visual Segmentation [50.69377287012591]
We propose the implicit counterfactual framework (ICF) to achieve unbiased cross-modal understanding.<n>Due to the lack of semantics, heterogeneous representations may lead to erroneous matches.<n>We introduce the multi-granularity implicit text (MIT) involving video-, segment- and frame-level as the bridge to establish the modality-shared space.
arXiv Detail & Related papers (2025-07-28T11:46:35Z)
Improving Contextual ASR via Multi-grained Fusion with Large Language Models [12.755830619473368]
We propose a novel multi-grained fusion approach that jointly leverages the strengths of both token-level and phrase-level fusion with Large Language Models (LLMs)<n>Our approach incorporates a late-fusion strategy that combines ASR's acoustic information with LLM's rich contextual knowledge, balancing fine-grained token precision with holistic phrase-level understanding.<n> Experiments on Chinese and English datasets demonstrate that our approach achieves state-of-the-art performance on keyword-related metrics.
arXiv Detail & Related papers (2025-07-16T13:59:32Z)
Importance-Based Token Merging for Efficient Image and Video Generation [41.94334394794811]
We show that preserving high-information tokens during merging significantly improves sample quality. We propose an importance-based token merging method that prioritizes the most critical tokens in computational resource allocation.
arXiv Detail & Related papers (2024-11-23T02:01:49Z)
Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL) GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval. Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z)
UATVR: Uncertainty-Adaptive Text-Video Retrieval [90.8952122146241]
A common practice is to transfer text-video pairs to the same embedding space and craft cross-modal interactions with certain entities. We propose an Uncertainty-language Text-Video Retrieval approach, termed UATVR, which models each look-up as a distribution matching procedure.
arXiv Detail & Related papers (2023-01-16T08:43:17Z)
Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels. contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text. There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z)
Disentangled Representation Learning for Text-Video Retrieval [51.861423831566626]
Cross-modality interaction is a critical component in Text-Video Retrieval (TVR) We study the interaction paradigm in depth, where we find that its computation can be split into two terms. We propose a disentangled framework to capture a sequential and hierarchical representation.
arXiv Detail & Related papers (2022-03-14T13:55:33Z)
Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query. We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR. Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z)
Contrastive Transformation for Self-supervised Correspondence Learning [120.62547360463923]
We study the self-supervised learning of visual correspondence using unlabeled videos in the wild. Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation. Our framework outperforms the recent self-supervised correspondence methods on a range of visual tasks.
arXiv Detail & Related papers (2020-12-09T14:05:06Z)
Beyond the Deep Metric Learning: Enhance the Cross-Modal Matching with Adversarial Discriminative Domain Regularization [21.904563910555368]
We propose a novel learning framework to construct a set of discriminative data domains within each image-text pairs. Our approach can generally improve the learning efficiency and the performance of existing metrics learning frameworks.
arXiv Detail & Related papers (2020-10-23T01:48:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.