Related papers: Delving Deeper: Hierarchical Visual Perception for Robust Video-Text Retrieval

Delving Deeper: Hierarchical Visual Perception for Robust Video-Text Retrieval

URL: http://arxiv.org/abs/2601.12768v1
Date: Mon, 19 Jan 2026 06:55:33 GMT
Title: Delving Deeper: Hierarchical Visual Perception for Robust Video-Text Retrieval
Authors: Zequn Xie, Boyun Zhang, Yuxiao Lin, Tao Jin,
Abstract summary: Video-text retrieval (VTR) aims to locate relevant videos using natural language queries.<n>Current methods, often based on pre-trained models like CLIP, are hindered by video's inherent redundancy and their reliance on coarse, final-layer features.<n>We introduce the HVP-Net, a framework that mines richer video semantics by extracting and refining features from multiple intermediate layers of a vision encoder.
Score: 9.243219818283263
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video-text retrieval (VTR) aims to locate relevant videos using natural language queries. Current methods, often based on pre-trained models like CLIP, are hindered by video's inherent redundancy and their reliance on coarse, final-layer features, limiting matching accuracy. To address this, we introduce the HVP-Net (Hierarchical Visual Perception Network), a framework that mines richer video semantics by extracting and refining features from multiple intermediate layers of a vision encoder. Our approach progressively distills salient visual concepts from raw patch-tokens at different semantic levels, mitigating redundancy while preserving crucial details for alignment. This results in a more robust video representation, leading to new state-of-the-art performance on challenging benchmarks including MSRVTT, DiDeMo, and ActivityNet. Our work validates the effectiveness of exploiting hierarchical features for advancing video-text retrieval. Our codes are available at https://github.com/boyun-zhang/HVP-Net.

Related papers

PREGEN: Uncovering Latent Thoughts in Composed Video Retrieval [9.493866391853723]
Composed Video Retrieval (CoVR) aims to retrieve a video based on a query video and a modifying text.<n>Current CoVR methods fail to fully exploit modern Vision-Language Models (VLMs)<n>We introduce PREGEN, an efficient and powerful CoVR framework that overcomes these limitations.
arXiv Detail & Related papers (2026-01-20T09:57:04Z)
Dual Learning with Dynamic Knowledge Distillation and Soft Alignment for Partially Relevant Video Retrieval [53.54695034420311]
In practice, videos are typically untrimmed in long durations with much more complicated background content.<n>We propose a novel framework that distills generalization knowledge from a powerful large-scale vision-language pre-trained model.<n>Experiment results demonstrate that our proposed model achieves state-of-the-art performance on TVR, ActivityNet, and Charades-STA datasets.
arXiv Detail & Related papers (2025-10-14T08:38:20Z)
LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts [19.81035705650859]
We introduce LoVR, a benchmark specifically designed for long video-text retrieval.<n>LoVR contains 467 long videos and over 40,804 fine-grained clips with high-quality captions.<n>Our benchmark introduces longer videos, more detailed captions, and a larger-scale dataset.
arXiv Detail & Related papers (2025-05-20T04:49:09Z)
Perception Encoder: The best visual embeddings are not at the output of the network [70.86738083862099]
We introduce Perception (PE), a vision encoder for image and video understanding trained via simple vision-language learning.<n>We find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks.<n>Together, our PE family of models achieves best-in-class results on a wide variety of tasks.
arXiv Detail & Related papers (2025-04-17T17:59:57Z)
HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models [63.65066762436074]
HiTVideo aims to address the potential limitations of existing video tokenizers in text-to-video generation tasks.<n>It utilizes a 3D causal VAE with a multi-layer discrete token framework, encoding video content into hierarchically structured codebooks.
arXiv Detail & Related papers (2025-03-14T15:36:39Z)
Composed Video Retrieval via Enriched Context and Discriminative Embeddings [118.66322242183249]
Composed video retrieval (CoVR) is a challenging problem in computer vision. We introduce a novel CoVR framework that leverages detailed language descriptions to explicitly encode query-specific contextual information. Our approach achieves gains as high as around 7% in terms of recall@K=1 score.
arXiv Detail & Related papers (2024-03-25T17:59:03Z)
TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale [59.01246141215051]
We analyze the factor that leads to degradation from the perspective of language supervision. We propose a tunable-free pre-training strategy to retain the generalization ability of the text encoder. We produce a series of models, dubbed TVTSv2, with up to one billion parameters.
arXiv Detail & Related papers (2023-05-23T15:44:56Z)
Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR) Existing methods rely on separate pre-training feature extractors for visual and textual understanding. We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z)
Visual Commonsense-aware Representation Network for Video Captioning [84.67432867555044]
We propose a simple yet effective method, called Visual Commonsense-aware Representation Network (VCRN) for video captioning. Our method reaches state-of-the-art performance, indicating the effectiveness of our method.
arXiv Detail & Related papers (2022-11-17T11:27:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.