E.T. the Exceptional Trajectories: Text-to-camera-trajectory generation with character awareness
- URL: http://arxiv.org/abs/2407.01516v1
- Date: Mon, 1 Jul 2024 17:58:02 GMT
- Title: E.T. the Exceptional Trajectories: Text-to-camera-trajectory generation with character awareness
- Authors: Robin Courant, Nicolas Dufour, Xi Wang, Marc Christie, Vicky Kalogeiton,
- Abstract summary: We propose a dataset called the Exceptional Trajectories (E.T.) with camera trajectories along with character information and textual captions.
To our knowledge, this is the first dataset of its kind.
To show the potential applications of the E.T. dataset, we propose a diffusion-based approach, named DIRECTOR.
We train on the E.T. dataset CLaTr, a Contrastive Language-Trajectory embedding for evaluation metrics.
- Score: 9.79206550593288
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stories and emotions in movies emerge through the effect of well-thought-out directing decisions, in particular camera placement and movement over time. Crafting compelling camera trajectories remains a complex iterative process, even for skilful artists. To tackle this, in this paper, we propose a dataset called the Exceptional Trajectories (E.T.) with camera trajectories along with character information and textual captions encompassing descriptions of both camera and character. To our knowledge, this is the first dataset of its kind. To show the potential applications of the E.T. dataset, we propose a diffusion-based approach, named DIRECTOR, which generates complex camera trajectories from textual captions that describe the relation and synchronisation between the camera and characters. To ensure robust and accurate evaluations, we train on the E.T. dataset CLaTr, a Contrastive Language-Trajectory embedding for evaluation metrics. We posit that our proposed dataset and method significantly advance the democratization of cinematography, making it more accessible to common users.
Related papers
- ChatCam: Empowering Camera Control through Conversational AI [67.31920821192323]
ChatCam is a system that navigates camera movements through conversations with users.
To achieve this, we propose CineGPT, a GPT-based autoregressive model for text-conditioned camera trajectory generation.
We also develop an Anchor Determinator to ensure precise camera trajectory placement.
arXiv Detail & Related papers (2024-09-25T20:13:41Z) - NAVERO: Unlocking Fine-Grained Semantics for Video-Language Compositionality [52.08735848128973]
We study the capability of Video-Language (VidL) models in understanding compositions between objects, attributes, actions and their relations.
We propose a training method called NAVERO which utilizes video-text data augmented with negative texts to enhance composition understanding.
arXiv Detail & Related papers (2024-08-18T15:27:06Z) - Joint-Dataset Learning and Cross-Consistent Regularization for Text-to-Motion Retrieval [4.454835029368504]
We focus on the recently introduced text-motion retrieval which aim to search for sequences that are most relevant to a natural motion description.
Despite recent efforts to explore these promising avenues, a primary challenge remains the insufficient data available to train robust text-motion models.
We propose to investigate joint-dataset learning - where we train on multiple text-motion datasets simultaneously.
We also introduce a transformer-based motion encoder, called MoT++, which employs the specified-temporal attention to process sequences of skeleton data.
arXiv Detail & Related papers (2024-07-02T09:43:47Z) - Generating Human Interaction Motions in Scenes with Text Control [66.74298145999909]
We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models.
Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model.
To facilitate training, we embed annotated navigation and interaction motions within scenes.
arXiv Detail & Related papers (2024-04-16T16:04:38Z) - TVPR: Text-to-Video Person Retrieval and a New Benchmark [19.554989977778312]
We propose a new task called Text-to-Video Person Retrieval(TVPR)
TVPRN acquires video representations by fusing visual and motion representations of person videos.
TVPRN has achieved state-of-the-art performance on TVPReid dataset.
arXiv Detail & Related papers (2023-07-14T06:34:00Z) - Top-Down Framework for Weakly-supervised Grounded Image Captioning [19.00510117145054]
Weakly-supervised grounded image captioning aims to generate the caption and ground (localize) predicted object words in the input image without using bounding box supervision.
We propose a one-stage weakly-supervised grounded captioner that directly takes the RGB image as input to perform captioning and grounding at the top-down image level.
arXiv Detail & Related papers (2023-06-13T01:42:18Z) - Cross-Camera Trajectories Help Person Retrieval in a Camera Network [124.65912458467643]
Existing methods often rely on purely visual matching or consider temporal constraints but ignore the spatial information of the camera network.
We propose a pedestrian retrieval framework based on cross-camera generation, which integrates both temporal and spatial information.
To verify the effectiveness of our method, we construct the first cross-camera pedestrian trajectory dataset.
arXiv Detail & Related papers (2022-04-27T13:10:48Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations
in Instructional Videos [78.34818195786846]
We introduce the task of spatially localizing narrated interactions in videos.
Key to our approach is the ability to learn to spatially localize interactions with self-supervision on a large corpus of videos with accompanying transcribed narrations.
We propose a multilayer cross-modal attention network that enables effective optimization of a contrastive loss during training.
arXiv Detail & Related papers (2021-10-20T14:45:13Z) - Batteries, camera, action! Learning a semantic control space for
expressive robot cinematography [15.895161373307378]
We develop a data-driven framework that enables editing of complex camera positioning parameters in a semantic space.
First, we generate a database of video clips with a diverse range of shots in a photo-realistic simulator.
We use hundreds of participants in a crowd-sourcing framework to obtain scores for a set of semantic descriptors for each clip.
arXiv Detail & Related papers (2020-11-19T21:56:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.