Exploiting Inter-Sample Correlation and Intra-Sample Redundancy for Partially Relevant Video Retrieval
- URL: http://arxiv.org/abs/2504.19637v1
- Date: Mon, 28 Apr 2025 09:52:46 GMT
- Title: Exploiting Inter-Sample Correlation and Intra-Sample Redundancy for Partially Relevant Video Retrieval
- Authors: Junlong Ren, Gangjian Zhang, Yu Hu, Jian Shu, Hao Wang,
- Abstract summary: PRVR aims to retrieve the target video that is partially relevant to a text query.<n>Existing methods coarsely align paired videos and text queries to construct the semantic space.<n>We propose a novel PRVR framework to systematically exploit inter-sample correlation and intra-sample redundancy.
- Score: 5.849812241074385
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Partially Relevant Video Retrieval (PRVR) aims to retrieve the target video that is partially relevant to the text query. The primary challenge in PRVR arises from the semantic asymmetry between textual and visual modalities, as videos often contain substantial content irrelevant to the query. Existing methods coarsely align paired videos and text queries to construct the semantic space, neglecting the critical cross-modal dual nature inherent in this task: inter-sample correlation and intra-sample redundancy. To this end, we propose a novel PRVR framework to systematically exploit these two characteristics. Our framework consists of three core modules. First, the Inter Correlation Enhancement (ICE) module captures inter-sample correlation by identifying semantically similar yet unpaired text queries and video moments, combining them to form pseudo-positive pairs for more robust semantic space construction. Second, the Intra Redundancy Mining (IRM) module mitigates intra-sample redundancy by mining redundant video moment features and treating them as hard negative samples, thereby encouraging the model to learn more discriminative representations. Finally, to reinforce these modules, we introduce the Temporal Coherence Prediction (TCP) module, which enhances feature discrimination by training the model to predict the original temporal order of randomly shuffled video frames and moments. Extensive experiments on three datasets demonstrate the superiority of our approach compared to previous methods, achieving state-of-the-art results.
Related papers
- Hierarchical Relation-augmented Representation Generalization for Few-shot Action Recognition [53.02634128715853]
Few-shot action recognition (FSAR) aims to recognize novel action categories with few exemplars.<n>We propose HR2G-shot, a Hierarchical Relation-augmented Representation Generalization framework for FSAR.<n>It unifies three types of relation modeling (inter-frame, inter-video, and inter-task) to learn task-specific temporal patterns from a holistic view.
arXiv Detail & Related papers (2025-04-14T10:23:22Z) - VideoFusion: A Spatio-Temporal Collaborative Network for Mutli-modal Video Fusion and Restoration [26.59510171451438]
Existing multi-sensor fusion research predominantly integrates complementary from multiple images rather than videos.<n>VideoFusion exploits cross-modal complementarity and temporal dynamics to generate context-temporally coherent videos.<n>Extensive experiments reveal that VideoFusion outperforms existing image-oriented fusion paradigms in sequential scenarios.
arXiv Detail & Related papers (2025-03-30T08:27:18Z) - Bridging Vision and Language: Modeling Causality and Temporality in Video Narratives [0.0]
We propose an enhanced framework that integrates a Causal-Temporal Reasoning Module into state-of-the-art LVLMs.<n>CTRM comprises two key components: the Causal Dynamics (CDE) and the Temporal Learner (TRL)<n>We design a multi-stage learning strategy to optimize the model, combining pre-training on large-scale video-text datasets.
arXiv Detail & Related papers (2024-12-14T07:28:38Z) - Cross-modal Contrastive Learning with Asymmetric Co-attention Network
for Video Moment Retrieval [0.17590081165362778]
Video moment retrieval is a challenging task requiring fine-grained interactions between video and text modalities.
Recent work in image-text pretraining has demonstrated that most existing pretrained models suffer from information asymmetry due to the difference in length between visual and textual sequences.
We question whether the same problem also exists in the video-text domain with an auxiliary need to preserve both spatial and temporal information.
arXiv Detail & Related papers (2023-12-12T17:00:46Z) - Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models [58.17315970207874]
We propose a zero-shot method for adapting generalisable visual-textual priors from arbitrary VLM to facilitate moment-text alignment.
Experiments conducted on three VMR benchmark datasets demonstrate the notable performance advantages of our zero-shot algorithm.
arXiv Detail & Related papers (2023-09-01T13:06:50Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Support-set based Multi-modal Representation Enhancement for Video
Captioning [121.70886789958799]
We propose a Support-set based Multi-modal Representation Enhancement (SMRE) model to mine rich information in a semantic subspace shared between samples.
Specifically, we propose a Support-set Construction (SC) module to construct a support-set to learn underlying connections between samples and obtain semantic-related visual elements.
During this process, we design a Semantic Space Transformation (SST) module to constrain relative distance and administrate multi-modal interactions in a self-supervised way.
arXiv Detail & Related papers (2022-05-19T03:40:29Z) - Disentangled Representation Learning for Text-Video Retrieval [51.861423831566626]
Cross-modality interaction is a critical component in Text-Video Retrieval (TVR)
We study the interaction paradigm in depth, where we find that its computation can be split into two terms.
We propose a disentangled framework to capture a sequential and hierarchical representation.
arXiv Detail & Related papers (2022-03-14T13:55:33Z) - Weakly-Supervised Spatio-Temporal Anomaly Detection in Surveillance
Video [128.41392860714635]
We introduce Weakly-Supervised Snoma-Temporally Detection (WSSTAD) in surveillance video.
WSSTAD aims to localize a-temporal tube (i.e. sequence of bounding boxes at consecutive times) that encloses abnormal event.
We propose a dual-branch network which takes as input proposals with multi-granularities in both spatial-temporal domains.
arXiv Detail & Related papers (2021-08-09T06:11:14Z) - Frame-wise Cross-modal Matching for Video Moment Retrieval [32.68921139236391]
Video moment retrieval targets at retrieving a moment in a video for a given language query.
The challenges of this task include 1) the requirement of localizing the relevant moment in an untrimmed video, and 2) bridging the semantic gap between textual query and video contents.
We propose an Attentive Cross-modal Relevance Matching model which predicts the temporal boundaries based on an interaction modeling.
arXiv Detail & Related papers (2020-09-22T10:25:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.