Mitigating Semantic Collapse in Partially Relevant Video Retrieval
- URL: http://arxiv.org/abs/2510.27432v1
- Date: Fri, 31 Oct 2025 12:39:20 GMT
- Title: Mitigating Semantic Collapse in Partially Relevant Video Retrieval
- Authors: WonJun Moon, MinSeok Jung, Gilhan Park, Tae-Young Kim, Cheol-Ho Cho, Woojin Jun, Jae-Pil Heo,
- Abstract summary: Partially Relevant Video Retrieval seeks videos where only part of the content matches a text query.<n>Existing methods treat every annotated text-video pair as a positive and all others as negatives.<n>This paper addresses the aforementioned problems, termed as semantic collapse, in both the text and video embedding spaces.
- Score: 41.715994314208025
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Partially Relevant Video Retrieval (PRVR) seeks videos where only part of the content matches a text query. Existing methods treat every annotated text-video pair as a positive and all others as negatives, ignoring the rich semantic variation both within a single video and across different videos. Consequently, embeddings of both queries and their corresponding video-clip segments for distinct events within the same video collapse together, while embeddings of semantically similar queries and segments from different videos are driven apart. This limits retrieval performance when videos contain multiple, diverse events. This paper addresses the aforementioned problems, termed as semantic collapse, in both the text and video embedding spaces. We first introduce Text Correlation Preservation Learning, which preserves the semantic relationships encoded by the foundation model across text queries. To address collapse in video embeddings, we propose Cross-Branch Video Alignment (CBVA), a contrastive alignment method that disentangles hierarchical video representations across temporal scales. Subsequently, we introduce order-preserving token merging and adaptive CBVA to enhance alignment by producing video segments that are internally coherent yet mutually distinctive. Extensive experiments on PRVR benchmarks demonstrate that our framework effectively prevents semantic collapse and substantially improves retrieval accuracy.
Related papers
- Ambiguity-Restrained Text-Video Representation Learning for Partially Relevant Video Retrieval [0.0]
Partially Relevant Video Retrieval(PRVR) aims to retrieve a video where a specific segment is relevant to a given text query.<n>We point out the inherent ambiguity between text and video content based on their conceptual scope.<n>We propose a framework that incorporates this ambiguity into the model learning process.
arXiv Detail & Related papers (2025-06-09T06:44:45Z) - VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models [48.00262713744499]
VideoComp is a benchmark and learning framework for advancing video-text compositionality understanding.<n>We create challenging negative samples with subtle temporal disruptions such as reordering, action word replacement, partial captioning, and combined disruptions.<n>These benchmarks comprehensively test models' compositional sensitivity across extended, cohesive video-text sequences.
arXiv Detail & Related papers (2025-04-04T22:24:30Z) - Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval [90.72791786676753]
Video-ColBERT introduces a simple and efficient mechanism for fine-grained similarity assessment between queries and videos.<n>We find that this interaction and training paradigm leads to strong individual, yet compatible, representations for encoding video content.<n>These representations lead to increases in performance on common text-to-video retrieval benchmarks compared to other bi-encoder methods.
arXiv Detail & Related papers (2025-03-24T17:51:29Z) - Text-Video Retrieval via Variational Multi-Modal Hypergraph Networks [25.96897989272303]
Main obstacle for text-video retrieval is the semantic gap between the textual nature of queries and the visual richness of video content.
We propose chunk-level text-video matching, where the query chunks are extracted to describe a specific retrieval unit.
We formulate the chunk-level matching as n-ary correlations modeling between words of the query and frames of the video.
arXiv Detail & Related papers (2024-01-06T09:38:55Z) - GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient
Partially Relevant Video Retrieval [59.47258928867802]
Given a text query, partially relevant video retrieval (PRVR) seeks to find videos containing pertinent moments in a database.
This paper proposes GMMFormer, a Gaussian-Mixture-Model based Transformer which models clip representations implicitly.
Experiments on three large-scale video datasets demonstrate the superiority and efficiency of GMMFormer.
arXiv Detail & Related papers (2023-10-08T15:04:50Z) - VADER: Video Alignment Differencing and Retrieval [70.88247176534426]
VADER matches and aligns partial video fragments to candidate videos using a robust visual descriptor and scalable search over chunked video content.
A space-time comparator module identifies regions of manipulation between content, invariant to any changes due to any residual temporal misalignments or artifacts arising from non-editorial changes of the content.
arXiv Detail & Related papers (2023-03-23T11:50:44Z) - Weakly Supervised Video Representation Learning with Unaligned Text for
Sequential Videos [39.42509966219001]
This paper studies weakly supervised sequential video understanding where the accurate time-level text-video alignment is not provided.
We use a transformer to aggregate frame-level features for video representation and use a pre-trained text encoder to encode the texts corresponding to each action and the whole video.
Experiments on video sequence verification and text-to-video matching show that our method outperforms baselines by a large margin.
arXiv Detail & Related papers (2023-03-22T08:13:25Z) - Text-Adaptive Multiple Visual Prototype Matching for Video-Text
Retrieval [125.55386778388818]
Cross-modal retrieval between videos and texts has gained increasing research interest due to the rapid emergence of videos on the web.
We propose a Text-Adaptive Multiple Visual Prototype Matching model, which automatically captures multiple prototypes to describe a video.
Our method outperforms state-of-the-art methods on four public video retrieval datasets.
arXiv Detail & Related papers (2022-09-27T11:13:48Z) - Correspondence Matters for Video Referring Expression Comprehension [64.60046797561455]
Video Referring Expression (REC) aims to localize the referent objects described in the sentence to visual regions in the video frames.
Existing methods suffer from two problems: 1) inconsistent localization results across video frames; 2) confusion between the referent and contextual objects.
We propose a novel Dual Correspondence Network (dubbed as DCNet) which explicitly enhances the dense associations in both the inter-frame and cross-modal manners.
arXiv Detail & Related papers (2022-07-21T10:31:39Z) - Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query.
We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR.
Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.