Batteries, camera, action! Learning a semantic control space for
expressive robot cinematography
- URL: http://arxiv.org/abs/2011.10118v2
- Date: Wed, 31 Mar 2021 21:15:21 GMT
- Title: Batteries, camera, action! Learning a semantic control space for
expressive robot cinematography
- Authors: Rogerio Bonatti, Arthur Bucker, Sebastian Scherer, Mustafa Mukadam and
Jessica Hodgins
- Abstract summary: We develop a data-driven framework that enables editing of complex camera positioning parameters in a semantic space.
First, we generate a database of video clips with a diverse range of shots in a photo-realistic simulator.
We use hundreds of participants in a crowd-sourcing framework to obtain scores for a set of semantic descriptors for each clip.
- Score: 15.895161373307378
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Aerial vehicles are revolutionizing the way film-makers can capture shots of
actors by composing novel aerial and dynamic viewpoints. However, despite great
advancements in autonomous flight technology, generating expressive camera
behaviors is still a challenge and requires non-technical users to edit a large
number of unintuitive control parameters. In this work, we develop a
data-driven framework that enables editing of these complex camera positioning
parameters in a semantic space (e.g. calm, enjoyable, establishing). First, we
generate a database of video clips with a diverse range of shots in a
photo-realistic simulator, and use hundreds of participants in a crowd-sourcing
framework to obtain scores for a set of semantic descriptors for each clip.
Next, we analyze correlations between descriptors and build a semantic control
space based on cinematography guidelines and human perception studies. Finally,
we learn a generative model that can map a set of desired semantic video
descriptors into low-level camera trajectory parameters. We evaluate our system
by demonstrating that our model successfully generates shots that are rated by
participants as having the expected degrees of expression for each descriptor.
We also show that our models generalize to different scenes in both simulation
and real-world experiments. Data and video found at:
https://sites.google.com/view/robotcam.
Related papers
- ChatCam: Empowering Camera Control through Conversational AI [67.31920821192323]
ChatCam is a system that navigates camera movements through conversations with users.
To achieve this, we propose CineGPT, a GPT-based autoregressive model for text-conditioned camera trajectory generation.
We also develop an Anchor Determinator to ensure precise camera trajectory placement.
arXiv Detail & Related papers (2024-09-25T20:13:41Z) - Redundancy-Aware Camera Selection for Indoor Scene Neural Rendering [54.468355408388675]
We build a similarity matrix that incorporates both the spatial diversity of the cameras and the semantic variation of the images.
We apply a diversity-based sampling algorithm to optimize the camera selection.
We also develop a new dataset, IndoorTraj, which includes long and complex camera movements captured by humans in virtual indoor environments.
arXiv Detail & Related papers (2024-09-11T08:36:49Z) - Image Conductor: Precision Control for Interactive Video Synthesis [90.2353794019393]
Filmmaking and animation production often require sophisticated techniques for coordinating camera transitions and object movements.
Image Conductor is a method for precise control of camera transitions and object movements to generate video assets from a single image.
arXiv Detail & Related papers (2024-06-21T17:55:05Z) - Learning Semantic Traversability with Egocentric Video and Automated Annotation Strategy [3.713586225621126]
A robot must have the ability to identify semantically traversable terrains in the image based on the semantic understanding of the scene.
This reasoning ability is based on semantic traversability, which is frequently achieved using semantic segmentation models fine-tuned on the testing domain.
We present an effective methodology for training a semantic traversability estimator using egocentric videos and an automated annotation process.
arXiv Detail & Related papers (2024-06-05T06:40:04Z) - Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis [43.02778060969546]
We propose a controllable monocular dynamic view synthesis pipeline.
Our model does not require depth as input, and does not explicitly model 3D scene geometry.
We believe our framework can potentially unlock powerful applications in rich dynamic scene understanding, perception for robotics, and interactive 3D video viewing experiences for virtual reality.
arXiv Detail & Related papers (2024-05-23T17:59:52Z) - Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation [65.46610405509338]
We seek to learn a generalizable goal-conditioned policy that enables zero-shot robot manipulation.
Our framework,Track2Act predicts tracks of how points in an image should move in future time-steps based on a goal.
We show that this approach of combining scalably learned track prediction with a residual policy enables diverse generalizable robot manipulation.
arXiv Detail & Related papers (2024-05-02T17:56:55Z) - Generating Human Interaction Motions in Scenes with Text Control [66.74298145999909]
We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models.
Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model.
To facilitate training, we embed annotated navigation and interaction motions within scenes.
arXiv Detail & Related papers (2024-04-16T16:04:38Z) - PathFinder: Attention-Driven Dynamic Non-Line-of-Sight Tracking with a Mobile Robot [3.387892563308912]
We introduce a novel approach to process a sequence of dynamic successive frames in a line-of-sight (LOS) video using an attention-based neural network.
We validate the approach on in-the-wild scenes using a drone for video capture, thus demonstrating low-cost NLOS imaging in dynamic capture environments.
arXiv Detail & Related papers (2024-04-07T17:31:53Z) - Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by
Implicitly Unprojecting to 3D [100.93808824091258]
We propose a new end-to-end architecture that directly extracts a bird's-eye-view representation of a scene given image data from an arbitrary number of cameras.
Our approach is to "lift" each image individually into a frustum of features for each camera, then "splat" all frustums into a bird's-eye-view grid.
We show that the representations inferred by our model enable interpretable end-to-end motion planning by "shooting" template trajectories into a bird's-eye-view cost map output by our network.
arXiv Detail & Related papers (2020-08-13T06:29:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.