Related papers: Action-conditioned video data improves predictability

Action-conditioned video data improves predictability

URL: http://arxiv.org/abs/2404.05439v1
Date: Mon, 8 Apr 2024 12:18:01 GMT
Title: Action-conditioned video data improves predictability
Authors: Meenakshi Sarkar, Debasish Ghose,
Abstract summary: Action-Conditioned Video Generation (ACVG) framework generates video sequences conditioned on the actions of robots. ACVG generates video sequences conditioned on the actions of robots, enabling exploration and analysis of how vision and action mutually influence one another.
Score: 1.9567015559455132
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Long-term video generation and prediction remain challenging tasks in computer vision, particularly in partially observable scenarios where cameras are mounted on moving platforms. The interaction between observed image frames and the motion of the recording agent introduces additional complexities. To address these issues, we introduce the Action-Conditioned Video Generation (ACVG) framework, a novel approach that investigates the relationship between actions and generated image frames through a deep dual Generator-Actor architecture. ACVG generates video sequences conditioned on the actions of robots, enabling exploration and analysis of how vision and action mutually influence one another in dynamic environments. We evaluate the framework's effectiveness on an indoor robot motion dataset which consists of sequences of image frames along with the sequences of actions taken by the robotic agent, conducting a comprehensive empirical study comparing ACVG to other state-of-the-art frameworks along with a detailed ablation study.

Related papers

InterDyn: Controllable Interactive Dynamics with Video Diffusion Models [50.38647583839384]
We propose InterDyn, a framework that generates videos of interactive dynamics given an initial frame and a control signal encoding the motion of a driving object or actor. Our key insight is that large video generation models can act as both neurals and implicit physics simulators'', having learned interactive dynamics from large-scale video data.
arXiv Detail & Related papers (2024-12-16T13:57:02Z)
Temporally Consistent Dynamic Scene Graphs: An End-to-End Approach for Action Tracklet Generation [1.6584112749108326]
TCDSG, Temporally Consistent Dynamic Scene Graphs, is an end-to-end framework that detects, tracks, and links subject-object relationships across time. Our work sets a new standard in multi-frame video analysis, opening new avenues for high-impact applications in surveillance, autonomous navigation, and beyond.
arXiv Detail & Related papers (2024-12-03T20:19:20Z)
Video Generation with Learned Action Prior [1.740992908651449]
Video generation is particularly challenging when the camera is mounted on a moving platform as camera motion interacts with image pixels. Existing methods typically address this by focusing on raw pixel-level image reconstruction without explicitly modelling camera motion dynamics. We propose a solution by considering camera or action as part of the observed image state, modelling both image state and action within a multi-AP learning framework.
arXiv Detail & Related papers (2024-06-20T16:00:07Z)
Motion-aware Latent Diffusion Models for Video Frame Interpolation [51.78737270917301]
Motion estimation between neighboring frames plays a crucial role in avoiding motion ambiguity. We propose a novel diffusion framework, motion-aware latent diffusion models (MADiff) Our method achieves state-of-the-art performance significantly outperforming existing approaches.
arXiv Detail & Related papers (2024-04-21T05:09:56Z)
RoboEXP: Action-Conditioned Scene Graph via Interactive Exploration for Robotic Manipulation [21.387160107315797]
We introduce the novel task of interactive scene exploration, wherein robots autonomously explore environments and produce an action-conditioned scene graph (ACSG) The ACSG accounts for both low-level information (geometry and semantics) and high-level information (action-conditioned relationships between different entities) in the scene. We present the Robotic Exploration (RoboEXP) system, which incorporates the Large Multimodal Model (LMM) and an explicit memory design to enhance our system's capabilities.
arXiv Detail & Related papers (2024-02-23T18:27:17Z)
Collaboratively Self-supervised Video Representation Learning for Action Recognition [58.195372471117615]
We design a Collaboratively Self-supervised Video Representation learning framework specific to action recognition. Our method achieves state-of-the-art performance on the UCF101 and HMDB51 datasets.
arXiv Detail & Related papers (2024-01-15T10:42:04Z)
TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models [75.20168902300166]
We propose TrackDiffusion, a novel video generation framework affording fine-grained trajectory-conditioned motion control. A pivotal component of TrackDiffusion is the instance enhancer, which explicitly ensures inter-frame consistency of multiple objects. generated video sequences by our TrackDiffusion can be used as training data for visual perception models.
arXiv Detail & Related papers (2023-12-01T15:24:38Z)
CycleACR: Cycle Modeling of Actor-Context Relations for Video Action Detection [67.90338302559672]
We propose to select actor-related scene context, rather than directly leverage raw video scenario, to improve relation modeling. We develop a Cycle Actor-Context Relation network (CycleACR) where there is a symmetric graph that models the actor and context relations in a bidirectional form. Compared to existing designs that focus on C2A-E, our CycleACR introduces A2C-R for a more effective relation modeling.
arXiv Detail & Related papers (2023-03-28T16:40:47Z)
TransCamP: Graph Transformer for 6-DoF Camera Pose Estimation [77.09542018140823]
We propose a neural network approach with a graph transformer backbone, namely TransCamP, to address the camera relocalization problem. TransCamP effectively fuses the image features, camera pose information and inter-frame relative camera motions into encoded graph attributes.
arXiv Detail & Related papers (2021-05-28T19:08:43Z)
Self-supervised Video Object Segmentation by Motion Grouping [79.13206959575228]
We develop a computer vision system able to segment objects by exploiting motion cues. We introduce a simple variant of the Transformer to segment optical flow frames into primary objects and the background. We evaluate the proposed architecture on public benchmarks (DAVIS2016, SegTrackv2, and FBMS59)
arXiv Detail & Related papers (2021-04-15T17:59:32Z)
GATSBI: Generative Agent-centric Spatio-temporal Object Interaction [9.328991021103294]
GAT SBI is a generative model that transforms a sequence of raw observations into a structured representation. We show GAT SBI achieves superior on scene decomposition and video prediction compared to its state-of-the-art counterparts.
arXiv Detail & Related papers (2021-04-09T09:45:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.