DiffusionVMR: Diffusion Model for Joint Video Moment Retrieval and
Highlight Detection
- URL: http://arxiv.org/abs/2308.15109v2
- Date: Sat, 2 Mar 2024 12:34:42 GMT
- Title: DiffusionVMR: Diffusion Model for Joint Video Moment Retrieval and
Highlight Detection
- Authors: Henghao Zhao, Kevin Qinghong Lin, Rui Yan and Zechao Li
- Abstract summary: A novel framework, DiffusionVMR, is proposed to redefine the two tasks as a unified conditional denoising generation process.
Experiments conducted on five widely-used benchmarks demonstrate the effectiveness and flexibility of the proposed DiffusionVMR.
- Score: 38.12212015133935
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video moment retrieval and highlight detection have received attention in the
current era of video content proliferation, aiming to localize moments and
estimate clip relevances based on user-specific queries. Given that the video
content is continuous in time, there is often a lack of clear boundaries
between temporal events in a video. This boundary ambiguity makes it
challenging for the model to learn text-video clip correspondences, resulting
in the subpar performance of existing methods in predicting target segments. To
alleviate this problem, we propose to solve the two tasks jointly from the
perspective of denoising generation. Moreover, the target boundary can be
localized clearly by iterative refinement from coarse to fine. Specifically, a
novel framework, DiffusionVMR, is proposed to redefine the two tasks as a
unified conditional denoising generation process by combining the diffusion
model. During training, Gaussian noise is added to corrupt the ground truth,
with noisy candidates produced as input. The model is trained to reverse this
noise addition process. In the inference phase, DiffusionVMR initiates directly
from Gaussian noise and progressively refines the proposals from the noise to
the meaningful output. Notably, the proposed DiffusionVMR inherits the
advantages of diffusion models that allow for iteratively refined results
during inference, enhancing the boundary transition from coarse to fine.
Furthermore, the training and inference of DiffusionVMR are decoupled. An
arbitrary setting can be used in DiffusionVMR during inference without
consistency with the training phase. Extensive experiments conducted on five
widely-used benchmarks (i.e., QVHighlight, Charades-STA, TACoS,
YouTubeHighlights and TVSum) across two tasks (moment retrieval and/or
highlight detection) demonstrate the effectiveness and flexibility of the
proposed DiffusionVMR.
Related papers
- Diffusion-TS: Interpretable Diffusion for General Time Series Generation [6.639630994040322]
Diffusion-TS is a novel diffusion-based framework that generates time series samples of high quality.
We train the model to directly reconstruct the sample instead of the noise in each diffusion step, combining a Fourier-based loss term.
Results show that Diffusion-TS achieves the state-of-the-art results on various realistic analyses of time series.
arXiv Detail & Related papers (2024-03-04T05:39:23Z) - Exploring Iterative Refinement with Diffusion Models for Video Grounding [17.435735275438923]
Video grounding aims to localize the target moment in an untrimmed video corresponding to a given sentence query.
We propose DiffusionVG, a novel framework with diffusion models that formulates video grounding as a conditional generation task.
arXiv Detail & Related papers (2023-10-26T07:04:44Z) - Single and Few-step Diffusion for Generative Speech Enhancement [18.487296462927034]
Diffusion models have shown promising results in speech enhancement.
In this paper, we address these limitations through a two-stage training approach.
We show that our proposed method keeps a steady performance and therefore largely outperforms the diffusion baseline in this setting.
arXiv Detail & Related papers (2023-09-18T11:30:58Z) - DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective.
Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process.
During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z) - DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion [137.8749239614528]
We propose a new formulation of temporal action detection (TAD) with denoising diffusion, DiffTAD.
Taking as input random temporal proposals, it can yield action proposals accurately given an untrimmed long video.
arXiv Detail & Related papers (2023-03-27T00:40:52Z) - DiffusionRet: Generative Text-Video Retrieval with Diffusion Model [56.03464169048182]
Existing text-video retrieval solutions focus on maximizing the conditional likelihood, i.e., p(candidates|query)
We creatively tackle this task from a generative viewpoint and model the correlation between the text and the video as their joint probability p(candidates,query)
This is accomplished through a diffusion-based text-video retrieval framework (DiffusionRet), which models the retrieval task as a process of gradually generating joint distribution from noise.
arXiv Detail & Related papers (2023-03-17T10:07:19Z) - VideoFusion: Decomposed Diffusion Models for High-Quality Video
Generation [88.49030739715701]
This work presents a decomposed diffusion process via resolving the per-frame noise into a base noise that is shared among all frames and a residual noise that varies along the time axis.
Experiments on various datasets confirm that our approach, termed as VideoFusion, surpasses both GAN-based and diffusion-based alternatives in high-quality video generation.
arXiv Detail & Related papers (2023-03-15T02:16:39Z) - Regularized Two-Branch Proposal Networks for Weakly-Supervised Moment
Retrieval in Videos [108.55320735031721]
Video moment retrieval aims to localize the target moment in a video according to the given sentence.
Most existing weak-supervised methods apply a MIL-based framework to develop inter-sample confrontment.
We propose a novel Regularized Two-Branch Proposal Network to simultaneously consider the inter-sample and intra-sample confrontments.
arXiv Detail & Related papers (2020-08-19T04:42:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.