VRMDiff: Text-Guided Video Referring Matting Generation of Diffusion
- URL: http://arxiv.org/abs/2503.10678v1
- Date: Tue, 11 Mar 2025 06:12:35 GMT
- Title: VRMDiff: Text-Guided Video Referring Matting Generation of Diffusion
- Authors: Lehan Yang, Jincen Song, Tianlong Wang, Daiqing Qi, Weili Shi, Yuheng Liu, Sheng Li,
- Abstract summary: We propose a new task, video referring matting, which obtains the alpha matte of a specified instance by inputting a referring caption.<n>We treat the dense prediction task of matting as video generation, leveraging the text-to-video alignment prior to video diffusion models.<n>We introduce a large-scale video referring matting dataset with 10,000 videos.
- Score: 9.465414294387507
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a new task, video referring matting, which obtains the alpha matte of a specified instance by inputting a referring caption. We treat the dense prediction task of matting as video generation, leveraging the text-to-video alignment prior of video diffusion models to generate alpha mattes that are temporally coherent and closely related to the corresponding semantic instances. Moreover, we propose a new Latent-Constructive loss to further distinguish different instances, enabling more controllable interactive matting. Additionally, we introduce a large-scale video referring matting dataset with 10,000 videos. To the best of our knowledge, this is the first dataset that concurrently contains captions, videos, and instance-level alpha mattes. Extensive experiments demonstrate the effectiveness of our method. The dataset and code are available at https://github.com/Hansxsourse/VRMDiff.
Related papers
- Large-scale Pre-training for Grounded Video Caption Generation [74.23767687855279]
We propose a novel approach for captioning and object grounding in video, where the objects in the caption are grounded in the video via temporally dense bounding boxes.<n>We present a large-scale automatic annotation method that aggregates captions grounded with bounding boxes across individual frames into temporally dense and consistent bounding box annotations.<n>We introduce a new dataset, called iGround, of 3500 videos with manually annotated captions and dense-temporally grounded bounding boxes.
arXiv Detail & Related papers (2025-03-13T18:21:07Z) - Unified Dense Prediction of Video Diffusion [91.16237431830417]
We present a unified network for simultaneously generating videos and their corresponding entity segmentation and depth maps from text prompts.<n>We utilize colormap to represent entity masks and depth maps, tightly integrating dense prediction with RGB video generation.
arXiv Detail & Related papers (2025-03-12T12:41:02Z) - GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-grained Video-language Learning [20.210972863275924]
We introduce a Granularity EXpansion (GEX) method with Integration and Compression operations to expand the granularity of a single-grained dataset.<n>To better model multi-grained data, we introduce an Iterative Approximation Module (IAM) which embeds multi-grained videos and texts into a unified, low-dimensional semantic space.<n>We evaluate our work on three categories of video tasks across seven benchmark datasets, showcasing state-of-the-art or comparable performance.
arXiv Detail & Related papers (2024-12-10T17:50:53Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - MEVG: Multi-event Video Generation with Text-to-Video Models [18.06640097064693]
We introduce a novel diffusion-based video generation method, generating a video showing multiple events given multiple individual sentences from the user.
Our method does not require a large-scale video dataset since our method uses a pre-trained text-to-video generative model without a fine-tuning process.
Our proposed method is superior to other video-generative models in terms of temporal coherency of content and semantics.
arXiv Detail & Related papers (2023-12-07T06:53:25Z) - Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation [92.55296042611886]
We propose a framework called "Reuse and Diffuse" dubbed $textitVidRD$ to produce more frames following the frames already generated by an LDM.
We also propose a set of strategies for composing video-text data that involve diverse content from multiple existing datasets.
arXiv Detail & Related papers (2023-09-07T08:12:58Z) - Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation [55.36617538438858]
We propose a novel approach that strengthens the interaction between spatial and temporal perceptions.
We curate a large-scale and open-source video dataset called HD-VG-130M.
arXiv Detail & Related papers (2023-05-18T11:06:15Z) - A Feature-space Multimodal Data Augmentation Technique for Text-video
Retrieval [16.548016892117083]
Text-video retrieval methods have received increased attention over the past few years.
Data augmentation techniques were introduced to increase the performance on unseen test examples.
We propose a multimodal data augmentation technique which works in the feature space and creates new videos and captions by mixing semantically similar samples.
arXiv Detail & Related papers (2022-08-03T14:05:20Z) - Attention-guided Temporal Coherent Video Object Matting [78.82835351423383]
We propose a novel deep learning-based object matting method that can achieve temporally coherent matting results.
Its key component is an attention-based temporal aggregation module that maximizes image matting networks' strength.
We show how to effectively solve the trimap generation problem by fine-tuning a state-of-the-art video object segmentation network.
arXiv Detail & Related papers (2021-05-24T17:34:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.