Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback
- URL: http://arxiv.org/abs/2412.02617v1
- Date: Tue, 03 Dec 2024 17:44:23 GMT
- Title: Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback
- Authors: Hiroki Furuta, Heiga Zen, Dale Schuurmans, Aleksandra Faust, Yutaka Matsuo, Percy Liang, Sherry Yang,
- Abstract summary: We investigate the use of feedback to enhance the object dynamics in text-to-video models.<n>We show that our method can effectively optimize a wide variety of rewards, with binary AI feedback driving the most significant improvements in video quality for dynamic interactions.
- Score: 130.090296560882
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large text-to-video models hold immense potential for a wide range of downstream applications. However, these models struggle to accurately depict dynamic object interactions, often resulting in unrealistic movements and frequent violations of real-world physics. One solution inspired by large language models is to align generated outputs with desired outcomes using external feedback. This enables the model to refine its responses autonomously, eliminating extensive manual data collection. In this work, we investigate the use of feedback to enhance the object dynamics in text-to-video models. We aim to answer a critical question: what types of feedback, paired with which specific self-improvement algorithms, can most effectively improve text-video alignment and realistic object interactions? We begin by deriving a unified probabilistic objective for offline RL finetuning of text-to-video models. This perspective highlights how design elements in existing algorithms like KL regularization and policy projection emerge as specific choices within a unified framework. We then use derived methods to optimize a set of text-video alignment metrics (e.g., CLIP scores, optical flow), but notice that they often fail to align with human perceptions of generation quality. To address this limitation, we propose leveraging vision-language models to provide more nuanced feedback specifically tailored to object dynamics in videos. Our experiments demonstrate that our method can effectively optimize a wide variety of rewards, with binary AI feedback driving the most significant improvements in video quality for dynamic interactions, as confirmed by both AI and human evaluations. Notably, we observe substantial gains when using reward signals derived from AI feedback, particularly in scenarios involving complex interactions between multiple objects and realistic depictions of objects falling.
Related papers
- Re-HOLD: Video Hand Object Interaction Reenactment via adaptive Layout-instructed Diffusion Model [72.90370736032115]
We present a novel video Reenactment framework focusing on Human-Object Interaction (HOI) via an adaptive layout-instructed Diffusion model (Re-HOLD)
Our key insight is to employ specialized layout representation for hands and objects, respectively.
To further improve the generation quality of HOI, we design an interactive textural enhancement module for both hands and objects.
arXiv Detail & Related papers (2025-03-21T08:40:35Z) - Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning [71.02843679746563]
In egocentric video understanding, the motion of hands and objects as well as their interactions play a significant role by nature.
In this work, we aim to integrate the modeling of fine-grained hand-object dynamics into the video representation learning process.
We propose EgoVideo, a model with a new lightweight motion adapter to capture fine-grained hand-object motion information.
arXiv Detail & Related papers (2025-03-02T18:49:48Z) - Object-Centric Image to Video Generation with Language Guidance [17.50161162624179]
TextOCVP is an object-centric model for image-to-video generation guided by textual descriptions.
Our approach jointly models object dynamics and interactions while incorporating textual guidance, thus leading to accurate and controllable predictions.
arXiv Detail & Related papers (2025-02-17T10:46:47Z) - InterDyn: Controllable Interactive Dynamics with Video Diffusion Models [50.38647583839384]
We propose InterDyn, a framework that generates videos of interactive dynamics given an initial frame and a control signal encoding the motion of a driving object or actor.
Our key insight is that large video generation models can act as both neurals and implicit physics simulators'', having learned interactive dynamics from large-scale video data.
arXiv Detail & Related papers (2024-12-16T13:57:02Z) - InTraGen: Trajectory-controlled Video Generation for Object Interactions [100.79494904451246]
InTraGen is a pipeline for improved trajectory-based generation of object interaction scenarios.
Our results demonstrate improvements in both visual fidelity and quantitative performance.
arXiv Detail & Related papers (2024-11-25T14:27:50Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Helping Hands: An Object-Aware Ego-Centric Video Recognition Model [60.350851196619296]
We introduce an object-aware decoder for improving the performance of ego-centric representations on ego-centric videos.
We show that the model can act as a drop-in replacement for an ego-awareness video model to improve performance through visual-text grounding.
arXiv Detail & Related papers (2023-08-15T17:58:11Z) - Bidirectional Correlation-Driven Inter-Frame Interaction Transformer for
Referring Video Object Segmentation [44.952526831843386]
We propose a correlation-driven inter-frame interaction Transformer, dubbed BIFIT, to address these issues in RVOS.
Specifically, we design a lightweight plug-and-play inter-frame interaction module in the decoder.
A vision-ferring interaction is implemented before the Transformer to facilitate the correlation between the visual and linguistic features.
arXiv Detail & Related papers (2023-07-02T10:29:35Z) - Learn the Force We Can: Enabling Sparse Motion Control in Multi-Object
Video Generation [26.292052071093945]
We propose an unsupervised method to generate videos from a single frame and a sparse motion input.
Our trained model can generate unseen realistic object-to-object interactions.
We show that YODA is on par with or better than state of the art video generation prior work in terms of both controllability and video quality.
arXiv Detail & Related papers (2023-06-06T19:50:02Z) - Pre-training Contextualized World Models with In-the-wild Videos for
Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks.
We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling.
Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z) - STOA-VLP: Spatial-Temporal Modeling of Object and Action for
Video-Language Pre-training [30.16501510589718]
We propose a pre-training framework that jointly models object and action information across spatial and temporal dimensions.
We design two auxiliary tasks to better incorporate both kinds of information into the pre-training process of the video-language model.
arXiv Detail & Related papers (2023-02-20T03:13:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.