Augmenting Moment Retrieval: Zero-Dependency Two-Stage Learning
- URL: http://arxiv.org/abs/2510.19622v1
- Date: Wed, 22 Oct 2025 14:19:38 GMT
- Title: Augmenting Moment Retrieval: Zero-Dependency Two-Stage Learning
- Authors: Zhengxuan Wei, Jiajin Tang, Sibei Yang,
- Abstract summary: We propose a zero-external-dependency Augmented Moment Retrieval framework, AMR, to overcome local optima.<n>AMR resolves ambiguous boundary information and semantic confusion in existing annotations without additional data.<n>AMR achieves improved performance over prior state-of-the-art approaches.
- Score: 33.16156949633519
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing Moment Retrieval methods face three critical bottlenecks: (1) data scarcity forces models into shallow keyword-feature associations; (2) boundary ambiguity in transition regions between adjacent events; (3) insufficient discrimination of fine-grained semantics (e.g., distinguishing ``kicking" vs. ``throwing" a ball). In this paper, we propose a zero-external-dependency Augmented Moment Retrieval framework, AMR, designed to overcome local optima caused by insufficient data annotations and the lack of robust boundary and semantic discrimination capabilities. AMR is built upon two key insights: (1) it resolves ambiguous boundary information and semantic confusion in existing annotations without additional data (avoiding costly manual labeling), and (2) it preserves boundary and semantic discriminative capabilities enhanced by training while generalizing to real-world scenarios, significantly improving performance. Furthermore, we propose a two-stage training framework with cold-start and distillation adaptation. The cold-start stage employs curriculum learning on augmented data to build foundational boundary/semantic awareness. The distillation stage introduces dual query sets: Original Queries maintain DETR-based localization using frozen Base Queries from the cold-start model, while Active Queries dynamically adapt to real-data distributions. A cross-stage distillation loss enforces consistency between Original and Base Queries, preventing knowledge forgetting while enabling real-world generalization. Experiments on multiple benchmarks show that AMR achieves improved performance over prior state-of-the-art approaches.
Related papers
- GEM-TFL: Bridging Weak and Full Supervision for Forgery Localization through EM-Guided Decomposition and Temporal Refinement [24.929199892659636]
Temporal Forgery Localization aims to precisely identify manipulated segments within videos or audio streams, providing interpretable evidence for multimedia forensics and security.<n>Most existing TFL methods rely on dense frame-level labels in a fully supervised manner, but Weakly Supervised TFL (WS-TFL) reduces labeling cost by learning only from binary video-level labels.<n>We propose GEM-TFL, a two-phase classification-regression framework that effectively bridges the supervision gap between training and inference.
arXiv Detail & Related papers (2026-03-05T12:07:26Z) - Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation [73.32435804067883]
Generalizable Knowledge Distillation (GKD) is a multi-stage framework that explicitly enhances generalization.<n>Experiments on five domain generalization benchmarks demonstrate that GKD consistently outperforms existing KD methods.
arXiv Detail & Related papers (2026-03-03T03:18:12Z) - Enhancing Foundation VLM Robustness to Missing Modality: Scalable Diffusion for Bi-directional Feature Restoration [40.720288165545476]
We introduce an enhanced diffusion model as a pluggable mid-stage training module to effectively restore missing features.<n>Our strategy introduces two key innovations: (I) Dynamic Modality Gating, which adaptively leverages conditional features to steer the generation of semantically consistent features; (II) Cross-Modal Mutual Learning mechanism, which bridges the semantic spaces of dual encoders to achieve bidirectional alignment.
arXiv Detail & Related papers (2026-02-03T06:06:35Z) - DIS2: Disentanglement Meets Distillation with Classwise Attention for Robust Remote Sensing Segmentation under Missing Modalities [28.992992584085787]
DIS2 is a new paradigm shifting from modality-shared feature dependence to active, guided missing features compensation.<n> Compensatory features are explicitly captured which, when fused with the features of the available modality, approximate the ideal fused representation of the full-modality case.<n>Our proposed approach significantly outperforms state-of-the-art methods across benchmarks.
arXiv Detail & Related papers (2026-01-20T01:33:54Z) - Toward Stable Semi-Supervised Remote Sensing Segmentation via Co-Guidance and Co-Fusion [31.189038928192648]
Co2S is a semi-supervised RS segmentation framework that fuses priors from vision-language models and self-supervised models.<n>An explicit-implicit semantic co-guidance mechanism is introduced that utilizes text embeddings and learnable queries.<n>Experiments on six popular datasets demonstrate the superiority of the proposed method.
arXiv Detail & Related papers (2025-12-28T18:24:19Z) - DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models [55.30555646945055]
Text-to-Image (T2I) models are vulnerable to semantic leakage.<n>We introduce DeLeaker, a lightweight approach that mitigates leakage by directly intervening on the model's attention maps.<n>SLIM is the first dataset dedicated to semantic leakage.
arXiv Detail & Related papers (2025-10-16T17:39:21Z) - DE3S: Dual-Enhanced Soft-Sparse-Shape Learning for Medical Early Time-Series Classification [11.539700200482853]
ETSC is critical in time-sensitive medical applications such as sepsis.<n>It presents an inherent trade-off between accuracy and earliness.<n>We propose textbfDE3S, a framework to overcome these underlying challenges.
arXiv Detail & Related papers (2025-10-14T07:10:05Z) - Semantic-Inductive Attribute Selection for Zero-Shot Learning [4.083977531653519]
We study two complementary feature-selection strategies for Zero-Shot Learning (ZSL)<n>The first adapts embedded feature selection to the demands of ZSL, turning model-driven rankings into meaningful semantic pruning.<n>The second leverages evolutionary computation to directly explore the space of attribute subsets more broadly.
arXiv Detail & Related papers (2025-09-26T15:14:36Z) - Advancing Reliable Test-Time Adaptation of Vision-Language Models under Visual Variations [67.35596444651037]
Vision-language models (VLMs) exhibit remarkable zero-shot capabilities but struggle with distribution shifts in downstream tasks when labeled data is unavailable.<n>We propose a Reliable Test-time Adaptation (ReTA) method that enhances reliability from two perspectives.
arXiv Detail & Related papers (2025-07-13T05:37:33Z) - Forget Me Not: Fighting Local Overfitting with Knowledge Fusion and Distillation [6.7864586321550595]
We introduce a novel score that measures the forgetting rate of deep models on validation data.<n>We demonstrate that local overfitting can arise even without conventional overfitting.<n>We then introduce a two-stage approach that leverages the training history of a single model to recover and retain forgotten knowledge.
arXiv Detail & Related papers (2025-07-11T15:37:24Z) - What Makes Local Updates Effective: The Role of Data Heterogeneity and Smoothness [5.357435119431715]
The thesis contributes to a self-contained guide for analyzing Local SGD in heterogeneous environments.<n>The thesis also extends to online learning, providing fundamental bounds under both first-order and bandit feedback.
arXiv Detail & Related papers (2025-06-30T19:06:02Z) - Constrained Entropic Unlearning: A Primal-Dual Framework for Large Language Models [7.566515311806724]
Large Language Models (LLMs) deployed in real-world settings increasingly face the need to unlearn sensitive, outdated, or proprietary information.<n>Existing unlearning methods formulate forgetting and retention as a regularized trade-off, combining both objectives into a single scalarized loss.<n>We propose a new formulation of LLM unlearning as a constrained optimization problem: forgetting is enforced via a novel logit-margin flattening loss.
arXiv Detail & Related papers (2025-06-05T17:55:23Z) - Restoration Score Distillation: From Corrupted Diffusion Pretraining to One-Step High-Quality Generation [82.39763984380625]
We propose textitRestoration Score Distillation (RSD), a principled generalization of Denoising Score Distillation (DSD)<n>RSD accommodates a broader range of corruption types, such as blurred, incomplete, or low-resolution images.<n>It consistently surpasses its teacher model across diverse restoration tasks on both natural and scientific datasets.
arXiv Detail & Related papers (2025-05-19T17:21:03Z) - E2ED^2:Direct Mapping from Noise to Data for Enhanced Diffusion Models [15.270657838960114]
Diffusion models have established themselves as the de facto primary paradigm in visual generative modeling.<n>We present a novel end-to-end learning paradigm that establishes direct optimization from the final generated samples to initial noises.<n>Our method achieves substantial performance gains in terms of Fr'eche't Inception Distance (FID) and CLIP score, even with fewer sampling steps.
arXiv Detail & Related papers (2024-12-30T16:06:31Z) - Breaking Determinism: Fuzzy Modeling of Sequential Recommendation Using Discrete State Space Diffusion Model [66.91323540178739]
Sequential recommendation (SR) aims to predict items that users may be interested in based on their historical behavior.
We revisit SR from a novel information-theoretic perspective and find that sequential modeling methods fail to adequately capture randomness and unpredictability of user behavior.
Inspired by fuzzy information processing theory, this paper introduces the fuzzy sets of interaction sequences to overcome the limitations and better capture the evolution of users' real interests.
arXiv Detail & Related papers (2024-10-31T14:52:01Z) - Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion [57.232688209606515]
We present HTCL, a novel Temporal Temporal Context Learning paradigm for improving camera-based semantic scene completion.
Our method ranks $1st$ on the Semantic KITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU.
arXiv Detail & Related papers (2024-07-02T09:11:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.