Quantifying and Narrowing the Unknown: Interactive Text-to-Video Retrieval via Uncertainty Minimization
- URL: http://arxiv.org/abs/2507.15504v2
- Date: Thu, 24 Jul 2025 10:32:09 GMT
- Title: Quantifying and Narrowing the Unknown: Interactive Text-to-Video Retrieval via Uncertainty Minimization
- Authors: Bingqing Zhang, Zhuo Cao, Heming Du, Yang Li, Xue Li, Jiajun Liu, Sen Wang,
- Abstract summary: UMIVR is an Uncertainty-Minimizing Interactive Text-to-Video Retrieval framework.<n>It quantifies three critical uncertainties-text ambiguity, mapping uncertainty, and frame uncertainty via principled, training-free metrics.<n>It iteratively refines user queries, significantly reducing retrieval ambiguity.
- Score: 17.763377515783155
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite recent advances, Text-to-video retrieval (TVR) is still hindered by multiple inherent uncertainties, such as ambiguous textual queries, indistinct text-video mappings, and low-quality video frames. Although interactive systems have emerged to address these challenges by refining user intent through clarifying questions, current methods typically rely on heuristic or ad-hoc strategies without explicitly quantifying these uncertainties, limiting their effectiveness. Motivated by this gap, we propose UMIVR, an Uncertainty-Minimizing Interactive Text-to-Video Retrieval framework that explicitly quantifies three critical uncertainties-text ambiguity, mapping uncertainty, and frame uncertainty-via principled, training-free metrics: semantic entropy-based Text Ambiguity Score (TAS), Jensen-Shannon divergence-based Mapping Uncertainty Score (MUS), and a Temporal Quality-based Frame Sampler (TQFS). By adaptively generating targeted clarifying questions guided by these uncertainty measures, UMIVR iteratively refines user queries, significantly reducing retrieval ambiguity. Extensive experiments on multiple benchmarks validate UMIVR's effectiveness, achieving notable gains in Recall@1 (69.2\% after 10 interactive rounds) on the MSR-VTT-1k dataset, thereby establishing an uncertainty-minimizing foundation for interactive TVR.
Related papers
- SUGAR: Leveraging Contextual Confidence for Smarter Retrieval [28.552283701883766]
We introduce Semantic Uncertainty Guided Adaptive Retrieval (SUGAR)<n>We leverage context-based entropy to actively decide whether to retrieve and to further determine between single-step and multi-step retrieval.<n>Our empirical results show that selective retrieval guided by semantic uncertainty estimation improves the performance across diverse question answering tasks, as well as achieves a more efficient inference.
arXiv Detail & Related papers (2025-01-09T01:24:59Z) - RACQUET: Unveiling the Dangers of Overlooked Referential Ambiguity in Visual LLMs [29.832360523402592]
We introduce RACQUET, a dataset targeting distinct aspects of ambiguity in image-based question answering.<n>We reveal significant limitations and problems of overconfidence of state-of-the-art large multimodal language models in addressing ambiguity in their responses.<n>Our results underscore the urgency of equipping models with robust strategies to deal with uncertainty without resorting to undesirable stereotypes.
arXiv Detail & Related papers (2024-12-18T13:25:11Z) - Improving Uncertainty Quantification in Large Language Models via Semantic Embeddings [11.33157177182775]
Accurately quantifying uncertainty in large language models (LLMs) is crucial for their reliable deployment.
Current state-of-the-art methods for measuring semantic uncertainty in LLMs rely on strict bidirectional entailment criteria.
We propose a novel approach that leverages semantic embeddings to achieve smoother and more robust estimation of semantic uncertainty.
arXiv Detail & Related papers (2024-10-30T04:41:46Z) - Certainly Uncertain: A Benchmark and Metric for Multimodal Epistemic and Aleatoric Awareness [106.52630978891054]
We present a taxonomy of uncertainty specific to vision-language AI systems.
We also introduce a new metric confidence-weighted accuracy, that is well correlated with both accuracy and calibration error.
arXiv Detail & Related papers (2024-07-02T04:23:54Z) - GMMFormer v2: An Uncertainty-aware Framework for Partially Relevant Video Retrieval [60.70901959953688]
We present GMMFormer v2, an uncertainty-aware framework for PRVR.
For clip modeling, we improve a strong baseline GMMFormer with a novel temporal consolidation module.
We propose a novel optimal matching loss for fine-grained text-clip alignment.
arXiv Detail & Related papers (2024-05-22T16:55:31Z) - Uncertainty-boosted Robust Video Activity Anticipation [72.14155465769201]
Video activity anticipation aims to predict what will happen in the future, embracing a broad application prospect ranging from robot vision to autonomous driving.
Despite the recent progress, the data uncertainty issue, reflected as the content evolution process and dynamic correlation in event labels, has been somehow ignored.
We propose an uncertainty-boosted robust video activity anticipation framework, which generates uncertainty values to indicate the credibility of the anticipation results.
arXiv Detail & Related papers (2024-04-29T12:31:38Z) - Echoes of Socratic Doubt: Embracing Uncertainty in Calibrated Evidential Reinforcement Learning [1.7898305876314982]
The proposed algorithm combines deep evidential learning with quantile calibration based on principles of conformal inference.
It is tested on a suite of miniaturized Atari games (i.e., MinAtar)
arXiv Detail & Related papers (2024-02-11T05:17:56Z) - Answering from Sure to Uncertain: Uncertainty-Aware Curriculum Learning
for Video Question Answering [63.12469700986452]
We introduce the concept of uncertainty-aware curriculum learning (CL)
Here, uncertainty serves as the guiding principle for dynamically adjusting the difficulty.
In practice, we seamlessly integrate the VideoQA model into our framework and conduct comprehensive experiments.
arXiv Detail & Related papers (2024-01-03T02:29:34Z) - Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models [58.17315970207874]
We propose a zero-shot method for adapting generalisable visual-textual priors from arbitrary VLM to facilitate moment-text alignment.
Experiments conducted on three VMR benchmark datasets demonstrate the notable performance advantages of our zero-shot algorithm.
arXiv Detail & Related papers (2023-09-01T13:06:50Z) - Towards Robust Text-Prompted Semantic Criterion for In-the-Wild Video
Quality Assessment [54.31355080688127]
We introduce a text-prompted Semantic Affinity Quality Index (SAQI) and its localized version (SAQI-Local) using Contrastive Language-Image Pre-training (CLIP)
BVQI-Local demonstrates unprecedented performance, surpassing existing zero-shot indices by at least 24% on all datasets.
We conduct comprehensive analyses to investigate different quality concerns of distinct indices, demonstrating the effectiveness and rationality of our design.
arXiv Detail & Related papers (2023-04-28T08:06:05Z) - Uncertain Facial Expression Recognition via Multi-task Assisted
Correction [43.02119884581332]
We propose a novel method of multi-task assisted correction in addressing uncertain facial expression recognition called MTAC.
Specifically, a confidence estimation block and a weighted regularization module are applied to highlight solid samples and suppress uncertain samples in every batch.
Experiments on RAF-DB, AffectNet, and AffWild2 datasets demonstrate that the MTAC obtains substantial improvements over baselines when facing synthetic and real uncertainties.
arXiv Detail & Related papers (2022-12-14T10:28:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.