MomentDiff: Generative Video Moment Retrieval from Random to Real
- URL: http://arxiv.org/abs/2307.02869v2
- Date: Wed, 11 Oct 2023 10:03:08 GMT
- Title: MomentDiff: Generative Video Moment Retrieval from Random to Real
- Authors: Pandeng Li, Chen-Wei Xie, Hongtao Xie, Liming Zhao, Lei Zhang, Yun
Zheng, Deli Zhao, Yongdong Zhang
- Abstract summary: We provide a generative diffusion-based framework called MomentDiff.
MomentDiff simulates a typical human retrieval process from random browsing to gradual localization.
We show that MomentDiff consistently outperforms state-of-the-art methods on three public benchmarks.
- Score: 71.40038773943638
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video moment retrieval pursues an efficient and generalized solution to
identify the specific temporal segments within an untrimmed video that
correspond to a given language description. To achieve this goal, we provide a
generative diffusion-based framework called MomentDiff, which simulates a
typical human retrieval process from random browsing to gradual localization.
Specifically, we first diffuse the real span to random noise, and learn to
denoise the random noise to the original span with the guidance of similarity
between text and video. This allows the model to learn a mapping from arbitrary
random locations to real moments, enabling the ability to locate segments from
random initialization. Once trained, MomentDiff could sample random temporal
segments as initial guesses and iteratively refine them to generate an accurate
temporal boundary. Different from discriminative works (e.g., based on
learnable proposals or queries), MomentDiff with random initialized spans could
resist the temporal location biases from datasets. To evaluate the influence of
the temporal location biases, we propose two anti-bias datasets with location
distribution shifts, named Charades-STA-Len and Charades-STA-Mom. The
experimental results demonstrate that our efficient framework consistently
outperforms state-of-the-art methods on three public benchmarks, and exhibits
better generalization and robustness on the proposed anti-bias datasets. The
code, model, and anti-bias evaluation datasets are available at
https://github.com/IMCCretrieval/MomentDiff.
Related papers
- PeFAD: A Parameter-Efficient Federated Framework for Time Series Anomaly Detection [51.20479454379662]
We propose a.
Federated Anomaly Detection framework named PeFAD with the increasing privacy concerns.
We conduct extensive evaluations on four real datasets, where PeFAD outperforms existing state-of-the-art baselines by up to 28.74%.
arXiv Detail & Related papers (2024-06-04T13:51:08Z) - Random Boxes Are Open-world Object Detectors [71.86454597677387]
We show that classifiers trained with random region proposals achieve state-of-the-art Open-world Object Detection (OWOD)
We propose RandBox, a Fast R-CNN based architecture trained on random proposals at each training.
RandBox significantly outperforms the previous state-of-the-art in all metrics.
arXiv Detail & Related papers (2023-07-17T05:08:32Z) - Sample and Predict Your Latent: Modality-free Sequential Disentanglement
via Contrastive Estimation [2.7759072740347017]
We introduce a self-supervised sequential disentanglement framework based on contrastive estimation with no external signals.
In practice, we propose a unified, efficient, and easy-to-code sampling strategy for semantically similar and dissimilar views of the data.
Our method presents state-of-the-art results in comparison to existing techniques.
arXiv Detail & Related papers (2023-05-25T10:50:30Z) - DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion [137.8749239614528]
We propose a new formulation of temporal action detection (TAD) with denoising diffusion, DiffTAD.
Taking as input random temporal proposals, it can yield action proposals accurately given an untrimmed long video.
arXiv Detail & Related papers (2023-03-27T00:40:52Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - Learning Sample Importance for Cross-Scenario Video Temporal Grounding [30.82619216537177]
The paper investigates some superficial biases specific to the temporal grounding task.
We propose a novel method called Debiased Temporal Language Localizer (DebiasTLL) to prevent the model from naively memorizing the biases.
We evaluate the proposed model in cross-scenario temporal grounding, where the train / test data are heterogeneously sourced.
arXiv Detail & Related papers (2022-01-08T15:41:38Z) - Deconfounded Video Moment Retrieval with Causal Intervention [80.90604360072831]
We tackle the task of video moment retrieval (VMR), which aims to localize a specific moment in a video according to a textual query.
Existing methods primarily model the matching relationship between query and moment by complex cross-modal interactions.
We propose a causality-inspired VMR framework that builds structural causal model to capture the true effect of query and video content on the prediction.
arXiv Detail & Related papers (2021-06-03T01:33:26Z) - AriEL: volume coding for sentence generation [5.972927416266617]
We improve on the performance of some of the standard methods in deep learning to generate sentences by uniformly sampling a continuous space.
We do it by proposing AriEL, that constructs volumes in a continuous space, without the need of encouraging the creation of volumes through the loss function.
Our results indicate that the random access to the stored information is dramatically improved, and our method AriEL is able to generate a wider variety of correct language by randomly sampling the latent space.
arXiv Detail & Related papers (2020-03-30T16:30:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.