Related papers: Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum

Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum

URL: http://arxiv.org/abs/2510.27571v1
Date: Fri, 31 Oct 2025 15:54:48 GMT
Title: Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum
Authors: Zhuoning Guo, Mingxin Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Xiaowen Chu,
Abstract summary: We introduce a framework built on the co-design of evaluation, data, and modeling.<n>First, we establish the Universal Video Retrieval Benchmark (UVRB)<n>Second, guided by UVRB's diagnostics, we introduce a scalable workflow that generates 1.55 million high-quality pairs.<n>Third, we devise the Modality Pyramid, a curriculum that trains our General Video Embedder (GVE)
Score: 36.360760591731484
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The prevailing video retrieval paradigm is structurally misaligned, as narrow benchmarks incentivize correspondingly limited data and single-task training. Therefore, universal capability is suppressed due to the absence of a diagnostic evaluation that defines and demands multi-dimensional generalization. To break this cycle, we introduce a framework built on the co-design of evaluation, data, and modeling. First, we establish the Universal Video Retrieval Benchmark (UVRB), a suite of 16 datasets designed not only to measure performance but also to diagnose critical capability gaps across tasks and domains. Second, guided by UVRB's diagnostics, we introduce a scalable synthesis workflow that generates 1.55 million high-quality pairs to populate the semantic space required for universality. Finally, we devise the Modality Pyramid, a curriculum that trains our General Video Embedder (GVE) by explicitly leveraging the latent interconnections within our diverse data. Extensive experiments show GVE achieves state-of-the-art zero-shot generalization on UVRB. In particular, our analysis reveals that popular benchmarks are poor predictors of general ability and that partially relevant retrieval is a dominant but overlooked scenario. Overall, our co-designed framework provides a practical path to escape the limited scope and advance toward truly universal video retrieval.

Related papers

RISE-Video: Can Video Generators Decode Implicit World Rules? [71.92434352963427]
We present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis.<n>RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories.<n>We propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment.
arXiv Detail & Related papers (2026-02-05T18:36:10Z)
VIPER: Process-aware Evaluation for Generative Video Reasoning [64.86465792516658]
We introduce VIPER, a comprehensive benchmark spanning 16 tasks across temporal, structural, symbolic, spatial, physics, and planning reasoning.<n>Our experiments reveal that state-of-the-art video models achieve only about 20% POC@1.0 and exhibit a significant outcome-hacking.
arXiv Detail & Related papers (2025-12-31T16:31:59Z)
MR$^2$-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval [86.35779264575154]
Multimodal retrieval is becoming a crucial component of modern AI applications, yet its evaluation lags behind the demands of more realistic and challenging scenarios.<n>We introduce MR$2$-Bench, a reasoning-intensive benchmark for multimodal retrieval.
arXiv Detail & Related papers (2025-09-30T15:09:14Z)
COME: Dual Structure-Semantic Learning with Collaborative MoE for Universal Lesion Detection Across Heterogeneous Ultrasound Datasets [25.82307075214309]
We propose a Universal Collaborative Mixture of Heterogeneous Source-Specific Experts (COME)<n>COME establishes dual structure-semantic shared experts that create a universal representation space and then collaborate with source-specific experts to extract discriminative features.<n>This design enables robust generalization by leveraging cross-datasets experience distributions and providing universal US priors for small-batch or unseen data scenarios.
arXiv Detail & Related papers (2025-08-13T15:43:20Z)
Team of One: Cracking Complex Video QA with Model Synergy [24.75732964829523]
We propose a novel framework for open-ended video question answering that enhances reasoning depth and robustness in complex real-world scenarios.<n>Existing Video-Large Multimodal Models (Video-LMMs) often exhibit limited contextual understanding, weak temporal modeling, and poor generalization to ambiguous or compositional queries.
arXiv Detail & Related papers (2025-07-18T11:12:44Z)
HumanVideo-MME: Benchmarking MLLMs for Human-Centric Video Understanding [120.84817886550765]
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos.<n>Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios.<n>We propose a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding.
arXiv Detail & Related papers (2025-07-07T11:52:24Z)
Multimodal Alignment with Cross-Attentive GRUs for Fine-Grained Video Understanding [0.0]
We propose a framework that fuses video, image, and textcoding using GRU-based sequence encoders and cross-modal attention mechanisms.<n>Our results demonstrate that the proposed fusion strategy significantly outperforms unimodal baselines.
arXiv Detail & Related papers (2025-07-04T12:35:52Z)
VACT: A Video Automatic Causal Testing System and a Benchmark [55.53300306960048]
VACT is an **automated** framework for modeling, evaluating, and measuring the causal understanding of VGMs in real-world scenarios.<n>We introduce multi-level causal evaluation metrics to provide a detailed analysis of the causal performance of VGMs.
arXiv Detail & Related papers (2025-03-08T10:54:42Z)
XTrack: Multimodal Training Boosts RGB-X Video Object Trackers [88.72203975896558]
It is crucial to ensure that knowledge gained from multimodal sensing is effectively shared.<n>Similar samples across different modalities have more knowledge to share than otherwise.<n>We propose a method for RGB-X tracker during inference, with an average +3% precision improvement over the current SOTA.
arXiv Detail & Related papers (2024-05-28T03:00:58Z)
BIRB: A Generalization Benchmark for Information Retrieval in Bioacoustics [7.68184437595058]
We present BIRB, a complex benchmark centered on the retrieval of bird vocalizations from passively-recorded datasets. We propose a baseline system for this collection of tasks using representation learning and a nearest-centroid search.
arXiv Detail & Related papers (2023-12-12T17:06:39Z)
Generalizable Person Search on Open-world User-Generated Video Content [93.72028298712118]
Person search is a challenging task that involves retrieving individuals from a large set of un-cropped scene images. Existing person search applications are mostly trained and deployed in the same-origin scenarios. We propose a generalizable framework on both feature-level and data-level generalization to facilitate downstream tasks in arbitrary scenarios.
arXiv Detail & Related papers (2023-10-16T04:59:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.