DiffusionRet: Generative Text-Video Retrieval with Diffusion Model
- URL: http://arxiv.org/abs/2303.09867v2
- Date: Sat, 19 Aug 2023 08:31:57 GMT
- Title: DiffusionRet: Generative Text-Video Retrieval with Diffusion Model
- Authors: Peng Jin, Hao Li, Zesen Cheng, Kehan Li, Xiangyang Ji, Chang Liu, Li
Yuan, Jie Chen
- Abstract summary: Existing text-video retrieval solutions focus on maximizing the conditional likelihood, i.e., p(candidates|query)
We creatively tackle this task from a generative viewpoint and model the correlation between the text and the video as their joint probability p(candidates,query)
This is accomplished through a diffusion-based text-video retrieval framework (DiffusionRet), which models the retrieval task as a process of gradually generating joint distribution from noise.
- Score: 56.03464169048182
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing text-video retrieval solutions are, in essence, discriminant models
focused on maximizing the conditional likelihood, i.e., p(candidates|query).
While straightforward, this de facto paradigm overlooks the underlying data
distribution p(query), which makes it challenging to identify
out-of-distribution data. To address this limitation, we creatively tackle this
task from a generative viewpoint and model the correlation between the text and
the video as their joint probability p(candidates,query). This is accomplished
through a diffusion-based text-video retrieval framework (DiffusionRet), which
models the retrieval task as a process of gradually generating joint
distribution from noise. During training, DiffusionRet is optimized from both
the generation and discrimination perspectives, with the generator being
optimized by generation loss and the feature extractor trained with contrastive
loss. In this way, DiffusionRet cleverly leverages the strengths of both
generative and discriminative methods. Extensive experiments on five commonly
used text-video retrieval benchmarks, including MSRVTT, LSMDC, MSVD,
ActivityNet Captions, and DiDeMo, with superior performances, justify the
efficacy of our method. More encouragingly, without any modification,
DiffusionRet even performs well in out-domain retrieval settings. We believe
this work brings fundamental insights into the related fields. Code is
available at https://github.com/jpthu17/DiffusionRet.
Related papers
- Diffusion-DICE: In-Sample Diffusion Guidance for Offline Reinforcement Learning [43.74071631716718]
We show that DICE-based methods can be viewed as a transformation from the behavior distribution to the optimal policy distribution.
We propose a novel approach, Diffusion-DICE, that directly performs this transformation using diffusion models.
arXiv Detail & Related papers (2024-07-29T15:36:42Z) - Ref-Diff: Zero-shot Referring Image Segmentation with Generative Models [68.73086826874733]
We introduce a novel Referring Diffusional segmentor (Ref-Diff) for referring image segmentation.
We demonstrate that without a proposal generator, a generative model alone can achieve comparable performance to existing SOTA weakly-supervised models.
This indicates that generative models are also beneficial for this task and can complement discriminative models for better referring segmentation.
arXiv Detail & Related papers (2023-08-31T14:55:30Z) - DiffusionVMR: Diffusion Model for Joint Video Moment Retrieval and
Highlight Detection [38.12212015133935]
A novel framework, DiffusionVMR, is proposed to redefine the two tasks as a unified conditional denoising generation process.
Experiments conducted on five widely-used benchmarks demonstrate the effectiveness and flexibility of the proposed DiffusionVMR.
arXiv Detail & Related papers (2023-08-29T08:20:23Z) - Reverse Stable Diffusion: What prompt was used to generate this image? [73.10116197883303]
We study the task of predicting the prompt embedding given an image generated by a generative diffusion model.
We propose a novel learning framework comprising a joint prompt regression and multi-label vocabulary classification objective.
We conduct experiments on the DiffusionDB data set, predicting text prompts from images generated by Stable Diffusion.
arXiv Detail & Related papers (2023-08-02T23:39:29Z) - Variational Diffusion Auto-encoder: Latent Space Extraction from
Pre-trained Diffusion Models [0.0]
Variational Auto-Encoders (VAEs) face challenges with the quality of generated images, often presenting noticeable blurriness.
This issue stems from the unrealistic assumption that approximates the conditional data distribution, $p(textbfx | textbfz)$, as an isotropic Gaussian.
We illustrate how one can extract a latent space from a pre-existing diffusion model by optimizing an encoder to maximize the marginal data log-likelihood.
arXiv Detail & Related papers (2023-04-24T14:44:47Z) - VideoFusion: Decomposed Diffusion Models for High-Quality Video
Generation [88.49030739715701]
This work presents a decomposed diffusion process via resolving the per-frame noise into a base noise that is shared among all frames and a residual noise that varies along the time axis.
Experiments on various datasets confirm that our approach, termed as VideoFusion, surpasses both GAN-based and diffusion-based alternatives in high-quality video generation.
arXiv Detail & Related papers (2023-03-15T02:16:39Z) - DiffusionInst: Diffusion Model for Instance Segmentation [15.438504077368936]
DiffusionInst is a novel framework that represents instances as instance-aware filters.
It is trained to reverse the noisy groundtruth without any inductive bias from RPN.
It achieves competitive performance compared to existing instance segmentation models.
arXiv Detail & Related papers (2022-12-06T05:52:12Z) - Unite and Conquer: Plug & Play Multi-Modal Synthesis using Diffusion
Models [54.1843419649895]
We propose a solution based on denoising diffusion probabilistic models (DDPMs)
Our motivation for choosing diffusion models over other generative models comes from the flexible internal structure of diffusion models.
Our method can unite multiple diffusion models trained on multiple sub-tasks and conquer the combined task.
arXiv Detail & Related papers (2022-12-01T18:59:55Z) - DORE: Document Ordered Relation Extraction based on Generative Framework [56.537386636819626]
This paper investigates the root cause of the underwhelming performance of the existing generative DocRE models.
We propose to generate a symbolic and ordered sequence from the relation matrix which is deterministic and easier for model to learn.
Experimental results on four datasets show that our proposed method can improve the performance of the generative DocRE models.
arXiv Detail & Related papers (2022-10-28T11:18:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.