Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum
- URL: http://arxiv.org/abs/2510.27571v1
- Date: Fri, 31 Oct 2025 15:54:48 GMT
- Title: Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum
- Authors: Zhuoning Guo, Mingxin Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Xiaowen Chu,
- Abstract summary: We introduce a framework built on the co-design of evaluation, data, and modeling.<n>First, we establish the Universal Video Retrieval Benchmark (UVRB)<n>Second, guided by UVRB's diagnostics, we introduce a scalable workflow that generates 1.55 million high-quality pairs.<n>Third, we devise the Modality Pyramid, a curriculum that trains our General Video Embedder (GVE)
- Score: 36.360760591731484
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The prevailing video retrieval paradigm is structurally misaligned, as narrow benchmarks incentivize correspondingly limited data and single-task training. Therefore, universal capability is suppressed due to the absence of a diagnostic evaluation that defines and demands multi-dimensional generalization. To break this cycle, we introduce a framework built on the co-design of evaluation, data, and modeling. First, we establish the Universal Video Retrieval Benchmark (UVRB), a suite of 16 datasets designed not only to measure performance but also to diagnose critical capability gaps across tasks and domains. Second, guided by UVRB's diagnostics, we introduce a scalable synthesis workflow that generates 1.55 million high-quality pairs to populate the semantic space required for universality. Finally, we devise the Modality Pyramid, a curriculum that trains our General Video Embedder (GVE) by explicitly leveraging the latent interconnections within our diverse data. Extensive experiments show GVE achieves state-of-the-art zero-shot generalization on UVRB. In particular, our analysis reveals that popular benchmarks are poor predictors of general ability and that partially relevant retrieval is a dominant but overlooked scenario. Overall, our co-designed framework provides a practical path to escape the limited scope and advance toward truly universal video retrieval.
Related papers
- RISE-Video: Can Video Generators Decode Implicit World Rules? [71.92434352963427]
We present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis.<n>RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories.<n>We propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment.
arXiv Detail & Related papers (2026-02-05T18:36:10Z) - VIPER: Process-aware Evaluation for Generative Video Reasoning [64.86465792516658]
We introduce VIPER, a comprehensive benchmark spanning 16 tasks across temporal, structural, symbolic, spatial, physics, and planning reasoning.<n>Our experiments reveal that state-of-the-art video models achieve only about 20% POC@1.0 and exhibit a significant outcome-hacking.
arXiv Detail & Related papers (2025-12-31T16:31:59Z) - MR$^2$-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval [86.35779264575154]
Multimodal retrieval is becoming a crucial component of modern AI applications, yet its evaluation lags behind the demands of more realistic and challenging scenarios.<n>We introduce MR$2$-Bench, a reasoning-intensive benchmark for multimodal retrieval.
arXiv Detail & Related papers (2025-09-30T15:09:14Z) - COME: Dual Structure-Semantic Learning with Collaborative MoE for Universal Lesion Detection Across Heterogeneous Ultrasound Datasets [25.82307075214309]
We propose a Universal Collaborative Mixture of Heterogeneous Source-Specific Experts (COME)<n>COME establishes dual structure-semantic shared experts that create a universal representation space and then collaborate with source-specific experts to extract discriminative features.<n>This design enables robust generalization by leveraging cross-datasets experience distributions and providing universal US priors for small-batch or unseen data scenarios.
arXiv Detail & Related papers (2025-08-13T15:43:20Z) - Team of One: Cracking Complex Video QA with Model Synergy [24.75732964829523]
We propose a novel framework for open-ended video question answering that enhances reasoning depth and robustness in complex real-world scenarios.<n>Existing Video-Large Multimodal Models (Video-LMMs) often exhibit limited contextual understanding, weak temporal modeling, and poor generalization to ambiguous or compositional queries.
arXiv Detail & Related papers (2025-07-18T11:12:44Z) - HumanVideo-MME: Benchmarking MLLMs for Human-Centric Video Understanding [120.84817886550765]
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos.<n>Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios.<n>We propose a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding.
arXiv Detail & Related papers (2025-07-07T11:52:24Z) - Multimodal Alignment with Cross-Attentive GRUs for Fine-Grained Video Understanding [0.0]
We propose a framework that fuses video, image, and textcoding using GRU-based sequence encoders and cross-modal attention mechanisms.<n>Our results demonstrate that the proposed fusion strategy significantly outperforms unimodal baselines.
arXiv Detail & Related papers (2025-07-04T12:35:52Z) - VACT: A Video Automatic Causal Testing System and a Benchmark [55.53300306960048]
VACT is an **automated** framework for modeling, evaluating, and measuring the causal understanding of VGMs in real-world scenarios.<n>We introduce multi-level causal evaluation metrics to provide a detailed analysis of the causal performance of VGMs.
arXiv Detail & Related papers (2025-03-08T10:54:42Z) - XTrack: Multimodal Training Boosts RGB-X Video Object Trackers [88.72203975896558]
It is crucial to ensure that knowledge gained from multimodal sensing is effectively shared.<n>Similar samples across different modalities have more knowledge to share than otherwise.<n>We propose a method for RGB-X tracker during inference, with an average +3% precision improvement over the current SOTA.
arXiv Detail & Related papers (2024-05-28T03:00:58Z) - BIRB: A Generalization Benchmark for Information Retrieval in
Bioacoustics [7.68184437595058]
We present BIRB, a complex benchmark centered on the retrieval of bird vocalizations from passively-recorded datasets.
We propose a baseline system for this collection of tasks using representation learning and a nearest-centroid search.
arXiv Detail & Related papers (2023-12-12T17:06:39Z) - Generalizable Person Search on Open-world User-Generated Video Content [93.72028298712118]
Person search is a challenging task that involves retrieving individuals from a large set of un-cropped scene images.
Existing person search applications are mostly trained and deployed in the same-origin scenarios.
We propose a generalizable framework on both feature-level and data-level generalization to facilitate downstream tasks in arbitrary scenarios.
arXiv Detail & Related papers (2023-10-16T04:59:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.