Exploring Iterative Refinement with Diffusion Models for Video Grounding
- URL: http://arxiv.org/abs/2310.17189v2
- Date: Fri, 29 Dec 2023 16:06:51 GMT
- Title: Exploring Iterative Refinement with Diffusion Models for Video Grounding
- Authors: Xiao Liang, Tao Shi, Yaoyuan Liang, Te Tao, Shao-Lun Huang
- Abstract summary: Video grounding aims to localize the target moment in an untrimmed video corresponding to a given sentence query.
We propose DiffusionVG, a novel framework with diffusion models that formulates video grounding as a conditional generation task.
- Score: 17.435735275438923
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video grounding aims to localize the target moment in an untrimmed video
corresponding to a given sentence query. Existing methods typically select the
best prediction from a set of predefined proposals or directly regress the
target span in a single-shot manner, resulting in the absence of a systematical
prediction refinement process. In this paper, we propose DiffusionVG, a novel
framework with diffusion models that formulates video grounding as a
conditional generation task, where the target span is generated from Gaussian
noise inputs and interatively refined in the reverse diffusion process. During
training, DiffusionVG progressively adds noise to the target span with a fixed
forward diffusion process and learns to recover the target span in the reverse
diffusion process. In inference, DiffusionVG can generate the target span from
Gaussian noise inputs by the learned reverse diffusion process conditioned on
the video-sentence representations. Without bells and whistles, our DiffusionVG
demonstrates superior performance compared to existing well-crafted models on
mainstream Charades-STA, ActivityNet Captions and TACoS benchmarks.
Related papers
- Diffusion-based Unsupervised Audio-visual Speech Enhancement [26.937216751657697]
This paper proposes a new unsupervised audiovisual speech enhancement (AVSE) approach.
It combines a diffusion-based audio-visual speech generative model with a non-negative matrix factorization (NMF) noise model.
Experimental results confirm that the proposed AVSE approach not only outperforms its audio-only counterpart but also generalizes better than a recent supervisedgenerative AVSE method.
arXiv Detail & Related papers (2024-10-04T12:22:54Z) - Improved off-policy training of diffusion samplers [93.66433483772055]
We study the problem of training diffusion models to sample from a distribution with an unnormalized density or energy function.
We benchmark several diffusion-structured inference methods, including simulation-based variational approaches and off-policy methods.
Our results shed light on the relative advantages of existing algorithms while bringing into question some claims from past work.
arXiv Detail & Related papers (2024-02-07T18:51:49Z) - Unsupervised speech enhancement with diffusion-based generative models [0.0]
We introduce an alternative approach that operates in an unsupervised manner, leveraging the generative power of diffusion models.
We develop a posterior sampling methodology for speech enhancement by combining the learnt clean speech prior with a noise model for speech signal inference.
We show promising results compared to a recent variational auto-encoder (VAE)-based unsupervised approach and a state-of-the-art diffusion-based supervised method.
arXiv Detail & Related papers (2023-09-19T09:11:31Z) - Single and Few-step Diffusion for Generative Speech Enhancement [18.487296462927034]
Diffusion models have shown promising results in speech enhancement.
In this paper, we address these limitations through a two-stage training approach.
We show that our proposed method keeps a steady performance and therefore largely outperforms the diffusion baseline in this setting.
arXiv Detail & Related papers (2023-09-18T11:30:58Z) - Diffusion-based 3D Object Detection with Random Boxes [58.43022365393569]
Existing anchor-based 3D detection methods rely on empiricals setting of anchors, which makes the algorithms lack elegance.
Our proposed Diff3Det migrates the diffusion model to proposal generation for 3D object detection by considering the detection boxes as generative targets.
In the inference stage, the model progressively refines a set of random boxes to the prediction results.
arXiv Detail & Related papers (2023-09-05T08:49:53Z) - DiffusionVMR: Diffusion Model for Joint Video Moment Retrieval and
Highlight Detection [38.12212015133935]
A novel framework, DiffusionVMR, is proposed to redefine the two tasks as a unified conditional denoising generation process.
Experiments conducted on five widely-used benchmarks demonstrate the effectiveness and flexibility of the proposed DiffusionVMR.
arXiv Detail & Related papers (2023-08-29T08:20:23Z) - DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion [137.8749239614528]
We propose a new formulation of temporal action detection (TAD) with denoising diffusion, DiffTAD.
Taking as input random temporal proposals, it can yield action proposals accurately given an untrimmed long video.
arXiv Detail & Related papers (2023-03-27T00:40:52Z) - DiffusionRet: Generative Text-Video Retrieval with Diffusion Model [56.03464169048182]
Existing text-video retrieval solutions focus on maximizing the conditional likelihood, i.e., p(candidates|query)
We creatively tackle this task from a generative viewpoint and model the correlation between the text and the video as their joint probability p(candidates,query)
This is accomplished through a diffusion-based text-video retrieval framework (DiffusionRet), which models the retrieval task as a process of gradually generating joint distribution from noise.
arXiv Detail & Related papers (2023-03-17T10:07:19Z) - VideoFusion: Decomposed Diffusion Models for High-Quality Video
Generation [88.49030739715701]
This work presents a decomposed diffusion process via resolving the per-frame noise into a base noise that is shared among all frames and a residual noise that varies along the time axis.
Experiments on various datasets confirm that our approach, termed as VideoFusion, surpasses both GAN-based and diffusion-based alternatives in high-quality video generation.
arXiv Detail & Related papers (2023-03-15T02:16:39Z) - Speech Enhancement and Dereverberation with Diffusion-based Generative
Models [14.734454356396157]
We present a detailed overview of the diffusion process that is based on a differential equation.
We show that this procedure enables using only 30 diffusion steps to generate high-quality clean speech estimates.
In an extensive cross-dataset evaluation, we show that the improved method can compete with recent discriminative models.
arXiv Detail & Related papers (2022-08-11T13:55:12Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.