Batteries, camera, action! Learning a semantic control space for
expressive robot cinematography
- URL: http://arxiv.org/abs/2011.10118v2
- Date: Wed, 31 Mar 2021 21:15:21 GMT
- Title: Batteries, camera, action! Learning a semantic control space for
expressive robot cinematography
- Authors: Rogerio Bonatti, Arthur Bucker, Sebastian Scherer, Mustafa Mukadam and
Jessica Hodgins
- Abstract summary: We develop a data-driven framework that enables editing of complex camera positioning parameters in a semantic space.
First, we generate a database of video clips with a diverse range of shots in a photo-realistic simulator.
We use hundreds of participants in a crowd-sourcing framework to obtain scores for a set of semantic descriptors for each clip.
- Score: 15.895161373307378
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Aerial vehicles are revolutionizing the way film-makers can capture shots of
actors by composing novel aerial and dynamic viewpoints. However, despite great
advancements in autonomous flight technology, generating expressive camera
behaviors is still a challenge and requires non-technical users to edit a large
number of unintuitive control parameters. In this work, we develop a
data-driven framework that enables editing of these complex camera positioning
parameters in a semantic space (e.g. calm, enjoyable, establishing). First, we
generate a database of video clips with a diverse range of shots in a
photo-realistic simulator, and use hundreds of participants in a crowd-sourcing
framework to obtain scores for a set of semantic descriptors for each clip.
Next, we analyze correlations between descriptors and build a semantic control
space based on cinematography guidelines and human perception studies. Finally,
we learn a generative model that can map a set of desired semantic video
descriptors into low-level camera trajectory parameters. We evaluate our system
by demonstrating that our model successfully generates shots that are rated by
participants as having the expected degrees of expression for each descriptor.
We also show that our models generalize to different scenes in both simulation
and real-world experiments. Data and video found at:
https://sites.google.com/view/robotcam.
Related papers
- Towards Understanding Camera Motions in Any Video [80.223048294482]
We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding.
CameraBench consists of 3,000 diverse internet videos annotated by experts through a rigorous quality control process.
One of our contributions is a taxonomy of camera motion primitives, designed in collaboration with cinematographers.
arXiv Detail & Related papers (2025-04-21T18:34:57Z) - GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography [98.28272367169465]
We introduce an auto-regressive model inspired by the expertise of Directors of Photography to generate artistic and expressive camera trajectories.
Thanks to the comprehensive and diverse database, we train an auto-regressive, decoder-only Transformer for high-quality, context-aware camera movement generation.
Experiments demonstrate that compared to existing methods, GenDoP offers better controllability, finer-grained trajectory adjustments, and higher motion stability.
arXiv Detail & Related papers (2025-04-09T17:56:01Z) - CameraCtrl II: Dynamic Scene Exploration via Camera-controlled Video Diffusion Models [89.63787060844409]
CameraCtrl II is a framework that enables large-scale dynamic scene exploration through a camera-controlled video diffusion model.
We take an approach that progressively expands the generation of dynamic scenes.
arXiv Detail & Related papers (2025-03-13T17:42:01Z) - ChatCam: Empowering Camera Control through Conversational AI [67.31920821192323]
ChatCam is a system that navigates camera movements through conversations with users.
To achieve this, we propose CineGPT, a GPT-based autoregressive model for text-conditioned camera trajectory generation.
We also develop an Anchor Determinator to ensure precise camera trajectory placement.
arXiv Detail & Related papers (2024-09-25T20:13:41Z) - Redundancy-Aware Camera Selection for Indoor Scene Neural Rendering [54.468355408388675]
We build a similarity matrix that incorporates both the spatial diversity of the cameras and the semantic variation of the images.
We apply a diversity-based sampling algorithm to optimize the camera selection.
We also develop a new dataset, IndoorTraj, which includes long and complex camera movements captured by humans in virtual indoor environments.
arXiv Detail & Related papers (2024-09-11T08:36:49Z) - Video In-context Learning: Autoregressive Transformers are Zero-Shot Video Imitators [46.40277880351059]
We explore utilizing visual signals as a new interface for models to interact with the environment.
We find that the model emerges a zero-shot capability to infer the semantics from a demonstration video, and imitate the semantics to an unseen scenario.
Results show that our models can generate high-quality video clips that accurately align with the semantic guidance provided by the demonstration videos.
arXiv Detail & Related papers (2024-07-10T04:27:06Z) - Image Conductor: Precision Control for Interactive Video Synthesis [90.2353794019393]
Filmmaking and animation production often require sophisticated techniques for coordinating camera transitions and object movements.
Image Conductor is a method for precise control of camera transitions and object movements to generate video assets from a single image.
arXiv Detail & Related papers (2024-06-21T17:55:05Z) - Learning Semantic Traversability with Egocentric Video and Automated Annotation Strategy [3.713586225621126]
A robot must have the ability to identify semantically traversable terrains in the image based on the semantic understanding of the scene.
This reasoning ability is based on semantic traversability, which is frequently achieved using semantic segmentation models fine-tuned on the testing domain.
We present an effective methodology for training a semantic traversability estimator using egocentric videos and an automated annotation process.
arXiv Detail & Related papers (2024-06-05T06:40:04Z) - Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis [43.02778060969546]
We propose a controllable monocular dynamic view synthesis pipeline.
Our model does not require depth as input, and does not explicitly model 3D scene geometry.
We believe our framework can potentially unlock powerful applications in rich dynamic scene understanding, perception for robotics, and interactive 3D video viewing experiences for virtual reality.
arXiv Detail & Related papers (2024-05-23T17:59:52Z) - Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation [65.46610405509338]
We seek to learn a generalizable goal-conditioned policy that enables zero-shot robot manipulation.
Our framework,Track2Act predicts tracks of how points in an image should move in future time-steps based on a goal.
We show that this approach of combining scalably learned track prediction with a residual policy enables diverse generalizable robot manipulation.
arXiv Detail & Related papers (2024-05-02T17:56:55Z) - Generating Human Interaction Motions in Scenes with Text Control [66.74298145999909]
We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models.
Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model.
To facilitate training, we embed annotated navigation and interaction motions within scenes.
arXiv Detail & Related papers (2024-04-16T16:04:38Z) - PathFinder: Attention-Driven Dynamic Non-Line-of-Sight Tracking with a Mobile Robot [3.387892563308912]
We introduce a novel approach to process a sequence of dynamic successive frames in a line-of-sight (LOS) video using an attention-based neural network.
We validate the approach on in-the-wild scenes using a drone for video capture, thus demonstrating low-cost NLOS imaging in dynamic capture environments.
arXiv Detail & Related papers (2024-04-07T17:31:53Z) - Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by
Implicitly Unprojecting to 3D [100.93808824091258]
We propose a new end-to-end architecture that directly extracts a bird's-eye-view representation of a scene given image data from an arbitrary number of cameras.
Our approach is to "lift" each image individually into a frustum of features for each camera, then "splat" all frustums into a bird's-eye-view grid.
We show that the representations inferred by our model enable interpretable end-to-end motion planning by "shooting" template trajectories into a bird's-eye-view cost map output by our network.
arXiv Detail & Related papers (2020-08-13T06:29:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.