Disentangling Content and Motion for Text-Based Neural Video
Manipulation
- URL: http://arxiv.org/abs/2211.02980v1
- Date: Sat, 5 Nov 2022 21:49:41 GMT
- Title: Disentangling Content and Motion for Text-Based Neural Video
Manipulation
- Authors: Levent Karacan, Tolga Kerimo\u{g}lu, \.Ismail \.Inan, Tolga Birdal,
Erkut Erdem, Aykut Erdem
- Abstract summary: We introduce a new method called DiCoMoGAN for manipulating videos with natural language.
Our evaluations demonstrate that DiCoMoGAN significantly outperforms existing frame-based methods.
- Score: 28.922000242744435
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Giving machines the ability to imagine possible new objects or scenes from
linguistic descriptions and produce their realistic renderings is arguably one
of the most challenging problems in computer vision. Recent advances in deep
generative models have led to new approaches that give promising results
towards this goal. In this paper, we introduce a new method called DiCoMoGAN
for manipulating videos with natural language, aiming to perform local and
semantic edits on a video clip to alter the appearances of an object of
interest. Our GAN architecture allows for better utilization of multiple
observations by disentangling content and motion to enable controllable
semantic edits. To this end, we introduce two tightly coupled networks: (i) a
representation network for constructing a concise understanding of motion
dynamics and temporally invariant content, and (ii) a translation network that
exploits the extracted latent content representation to actuate the
manipulation according to the target description. Our qualitative and
quantitative evaluations demonstrate that DiCoMoGAN significantly outperforms
existing frame-based methods, producing temporally coherent and semantically
more meaningful results.
Related papers
- Object-Centric Image to Video Generation with Language Guidance [17.50161162624179]
TextOCVP is an object-centric model for image-to-video generation guided by textual descriptions.
Our approach jointly models object dynamics and interactions while incorporating textual guidance, thus leading to accurate and controllable predictions.
arXiv Detail & Related papers (2025-02-17T10:46:47Z) - DynVFX: Augmenting Real Videos with Dynamic Content [19.393567535259518]
We present a method for augmenting real-world videos with newly generated dynamic content.
Given an input video and a simple user-provided text instruction describing the desired content, our method synthesizes dynamic objects or complex scene effects.
The position, appearance, and motion of the new content are seamlessly integrated into the original footage.
arXiv Detail & Related papers (2025-02-05T21:14:55Z) - Dynamic Scene Understanding from Vision-Language Representations [11.833972582610027]
We propose a framework for dynamic scene understanding tasks by leveraging knowledge from modern, frozen vision-language representations.
We achieve state-of-the-art results while using a minimal number of trainable parameters relative to existing approaches.
arXiv Detail & Related papers (2025-01-20T18:33:46Z) - Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback [130.090296560882]
We investigate the use of feedback to enhance the object dynamics in text-to-video models.
We show that our method can effectively optimize a wide variety of rewards, with binary AI feedback driving the most significant improvements in video quality for dynamic interactions.
arXiv Detail & Related papers (2024-12-03T17:44:23Z) - Realizing Video Summarization from the Path of Language-based Semantic Understanding [19.825666473712197]
We propose a novel video summarization framework inspired by the Mixture of Experts (MoE) paradigm.
Our approach integrates multiple VideoLLMs to generate comprehensive and coherent textual summaries.
arXiv Detail & Related papers (2024-10-06T15:03:22Z) - Context Propagation from Proposals for Semantic Video Object Segmentation [1.223779595809275]
We propose a novel approach to learning semantic contextual relationships in videos for semantic object segmentation.
Our proposals derives the semantic contexts from video object which encode the key evolution of objects and the relationship among objects over semantic-temporal domain.
arXiv Detail & Related papers (2024-07-08T14:44:18Z) - SemanticBoost: Elevating Motion Generation with Augmented Textual Cues [73.83255805408126]
Our framework comprises a Semantic Enhancement module and a Context-Attuned Motion Denoiser (CAMD)
The CAMD approach provides an all-encompassing solution for generating high-quality, semantically consistent motion sequences.
Our experimental results demonstrate that SemanticBoost, as a diffusion-based method, outperforms auto-regressive-based techniques.
arXiv Detail & Related papers (2023-10-31T09:58:11Z) - Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z) - Wide and Narrow: Video Prediction from Context and Motion [54.21624227408727]
We propose a new framework to integrate these complementary attributes to predict complex pixel dynamics through deep networks.
We present global context propagation networks that aggregate the non-local neighboring representations to preserve the contextual information over the past frames.
We also devise local filter memory networks that generate adaptive filter kernels by storing the motion of moving objects in the memory.
arXiv Detail & Related papers (2021-10-22T04:35:58Z) - Self-Supervised Representation Learning from Flow Equivariance [97.13056332559526]
We present a new self-supervised learning representation framework that can be directly deployed on a video stream of complex scenes.
Our representations, learned from high-resolution raw video, can be readily used for downstream tasks on static images.
arXiv Detail & Related papers (2021-01-16T23:44:09Z) - Neuro-Symbolic Representations for Video Captioning: A Case for
Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning.
Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.