Related papers: Disentangling Content and Motion for Text-Based Neural Video Manipulation

Disentangling Content and Motion for Text-Based Neural Video Manipulation

URL: http://arxiv.org/abs/2211.02980v1
Date: Sat, 5 Nov 2022 21:49:41 GMT
Title: Disentangling Content and Motion for Text-Based Neural Video Manipulation
Authors: Levent Karacan, Tolga Kerimo\u{g}lu, \.Ismail \.Inan, Tolga Birdal, Erkut Erdem, Aykut Erdem
Abstract summary: We introduce a new method called DiCoMoGAN for manipulating videos with natural language. Our evaluations demonstrate that DiCoMoGAN significantly outperforms existing frame-based methods.
Score: 28.922000242744435
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Giving machines the ability to imagine possible new objects or scenes from linguistic descriptions and produce their realistic renderings is arguably one of the most challenging problems in computer vision. Recent advances in deep generative models have led to new approaches that give promising results towards this goal. In this paper, we introduce a new method called DiCoMoGAN for manipulating videos with natural language, aiming to perform local and semantic edits on a video clip to alter the appearances of an object of interest. Our GAN architecture allows for better utilization of multiple observations by disentangling content and motion to enable controllable semantic edits. To this end, we introduce two tightly coupled networks: (i) a representation network for constructing a concise understanding of motion dynamics and temporally invariant content, and (ii) a translation network that exploits the extracted latent content representation to actuate the manipulation according to the target description. Our qualitative and quantitative evaluations demonstrate that DiCoMoGAN significantly outperforms existing frame-based methods, producing temporally coherent and semantically more meaningful results.

Related papers

Object-Centric Image to Video Generation with Language Guidance [17.50161162624179]
TextOCVP is an object-centric model for image-to-video generation guided by textual descriptions. Our approach jointly models object dynamics and interactions while incorporating textual guidance, thus leading to accurate and controllable predictions.
arXiv Detail & Related papers (2025-02-17T10:46:47Z)
DynVFX: Augmenting Real Videos with Dynamic Content [19.393567535259518]
We present a method for augmenting real-world videos with newly generated dynamic content. Given an input video and a simple user-provided text instruction describing the desired content, our method synthesizes dynamic objects or complex scene effects. The position, appearance, and motion of the new content are seamlessly integrated into the original footage.
arXiv Detail & Related papers (2025-02-05T21:14:55Z)
Dynamic Scene Understanding from Vision-Language Representations [11.833972582610027]
We propose a framework for dynamic scene understanding tasks by leveraging knowledge from modern, frozen vision-language representations. We achieve state-of-the-art results while using a minimal number of trainable parameters relative to existing approaches.
arXiv Detail & Related papers (2025-01-20T18:33:46Z)
Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback [130.090296560882]
We investigate the use of feedback to enhance the object dynamics in text-to-video models. We show that our method can effectively optimize a wide variety of rewards, with binary AI feedback driving the most significant improvements in video quality for dynamic interactions.
arXiv Detail & Related papers (2024-12-03T17:44:23Z)
Realizing Video Summarization from the Path of Language-based Semantic Understanding [19.825666473712197]
We propose a novel video summarization framework inspired by the Mixture of Experts (MoE) paradigm. Our approach integrates multiple VideoLLMs to generate comprehensive and coherent textual summaries.
arXiv Detail & Related papers (2024-10-06T15:03:22Z)
Context Propagation from Proposals for Semantic Video Object Segmentation [1.223779595809275]
We propose a novel approach to learning semantic contextual relationships in videos for semantic object segmentation. Our proposals derives the semantic contexts from video object which encode the key evolution of objects and the relationship among objects over semantic-temporal domain.
arXiv Detail & Related papers (2024-07-08T14:44:18Z)
Generated Contents Enrichment [11.196681396888536]
We propose a novel artificial intelligence task termed Generated Contents Enrichment (GCE) Our proposed GCE strives to perform content enrichment explicitly in both the visual and textual domains. To tackle GCE, we propose a deep end-to-end adversarial method that explicitly explores semantics and inter-semantic relationships.
arXiv Detail & Related papers (2024-05-06T17:14:09Z)
Semantics-aware Motion Retargeting with Vision-Language Models [19.53696208117539]
We present a novel Semantics-aware Motion reTargeting (SMT) method with the advantage of vision-language models to extract and maintain meaningful motion semantics. We utilize a differentiable module to render 3D motions and the high-level motion semantics are incorporated into the motion process by feeding the vision-language model and aligning the extracted semantic embeddings. To ensure the preservation of fine-grained motion details and high-level semantics, we adopt two-stage pipeline consisting of skeleton-aware pre-training and fine-tuning with semantics and geometry constraints.
arXiv Detail & Related papers (2023-12-04T15:23:49Z)
SemanticBoost: Elevating Motion Generation with Augmented Textual Cues [73.83255805408126]
Our framework comprises a Semantic Enhancement module and a Context-Attuned Motion Denoiser (CAMD) The CAMD approach provides an all-encompassing solution for generating high-quality, semantically consistent motion sequences. Our experimental results demonstrate that SemanticBoost, as a diffusion-based method, outperforms auto-regressive-based techniques.
arXiv Detail & Related papers (2023-10-31T09:58:11Z)
Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM. It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features. S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z)
Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence. We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z)
Wide and Narrow: Video Prediction from Context and Motion [54.21624227408727]
We propose a new framework to integrate these complementary attributes to predict complex pixel dynamics through deep networks. We present global context propagation networks that aggregate the non-local neighboring representations to preserve the contextual information over the past frames. We also devise local filter memory networks that generate adaptive filter kernels by storing the motion of moving objects in the memory.
arXiv Detail & Related papers (2021-10-22T04:35:58Z)
Self-Supervised Representation Learning from Flow Equivariance [97.13056332559526]
We present a new self-supervised learning representation framework that can be directly deployed on a video stream of complex scenes. Our representations, learned from high-resolution raw video, can be readily used for downstream tasks on static images.
arXiv Detail & Related papers (2021-01-16T23:44:09Z)
Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning. Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z)
OVC-Net: Object-Oriented Video Captioning with Temporal Graph and Detail Enhancement [44.228748086927375]
We introduce the video-based object-oriented video captioning network (OVC)-Net via temporal graph and detail enhancement. To demonstrate the effectiveness, we conduct experiments on the new dataset and compare it with the state-of-the-art video captioning methods.
arXiv Detail & Related papers (2020-03-08T04:34:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.