Temporal and Contextual Transformer for Multi-Camera Editing of TV Shows
- URL: http://arxiv.org/abs/2210.08737v1
- Date: Mon, 17 Oct 2022 04:11:23 GMT
- Title: Temporal and Contextual Transformer for Multi-Camera Editing of TV Shows
- Authors: Anyi Rao, Xuekun Jiang, Sichen Wang, Yuwei Guo, Zihao Liu, Bo Dai,
Long Pang, Xiaoyu Wu, Dahua Lin, Libiao Jin
- Abstract summary: We first collect a novel benchmark on this setting with four diverse scenarios including concerts, sports games, gala shows, and contests.
It contains 88-hour raw videos that contribute to the 14-hour edited videos.
We propose a new approach temporal and contextual transformer that utilizes clues from historical shots and other views to make shot transition decisions.
- Score: 83.54243912535667
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The ability to choose an appropriate camera view among multiple cameras plays
a vital role in TV shows delivery. But it is hard to figure out the statistical
pattern and apply intelligent processing due to the lack of high-quality
training data. To solve this issue, we first collect a novel benchmark on this
setting with four diverse scenarios including concerts, sports games, gala
shows, and contests, where each scenario contains 6 synchronized tracks
recorded by different cameras. It contains 88-hour raw videos that contribute
to the 14-hour edited videos. Based on this benchmark, we further propose a new
approach temporal and contextual transformer that utilizes clues from
historical shots and other views to make shot transition decisions and predict
which view to be used. Extensive experiments show that our method outperforms
existing methods on the proposed multi-camera editing benchmark.
Related papers
- Pseudo Dataset Generation for Out-of-Domain Multi-Camera View Recommendation [8.21260979799828]
We propose transforming regular videos into pseudo-labeled multi-camera view recommendation datasets.
By training the model on pseudo-labeled datasets stemming from videos in the target domain, we achieve a 68% relative improvement in the model's accuracy in the target domain.
arXiv Detail & Related papers (2024-10-17T14:21:22Z) - Redundancy-Aware Camera Selection for Indoor Scene Neural Rendering [54.468355408388675]
We build a similarity matrix that incorporates both the spatial diversity of the cameras and the semantic variation of the images.
We apply a diversity-based sampling algorithm to optimize the camera selection.
We also develop a new dataset, IndoorTraj, which includes long and complex camera movements captured by humans in virtual indoor environments.
arXiv Detail & Related papers (2024-09-11T08:36:49Z) - EventAid: Benchmarking Event-aided Image/Video Enhancement Algorithms
with Real-captured Hybrid Dataset [55.12137324648253]
Event cameras are emerging imaging technology that offers advantages over conventional frame-based imaging sensors in dynamic range and sensing speed.
This paper focuses on five event-aided image and video enhancement tasks.
arXiv Detail & Related papers (2023-12-13T15:42:04Z) - SEINE: Short-to-Long Video Diffusion Model for Generative Transition and
Prediction [93.26613503521664]
This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction.
We propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions.
Our model generates transition videos that ensure coherence and visual quality.
arXiv Detail & Related papers (2023-10-31T17:58:17Z) - Replay: Multi-modal Multi-view Acted Videos for Casual Holography [76.49914880351167]
Replay is a collection of multi-view, multi-modal videos of humans interacting socially.
Overall, the dataset contains over 4000 minutes of footage and over 7 million timestamped high-resolution frames.
The Replay dataset has many potential applications, such as novel-view synthesis, 3D reconstruction, novel-view acoustic synthesis, human body and face analysis, and training generative models.
arXiv Detail & Related papers (2023-07-22T12:24:07Z) - EVREAL: Towards a Comprehensive Benchmark and Analysis Suite for Event-based Video Reconstruction [16.432164340779266]
Event cameras offer advantages over traditional frame-based cameras such as high dynamic range and minimal motion blur.
Their output is not easily understandable by humans, making reconstruction of intensity images from event streams a fundamental task in event-based vision.
Recent deep learning-based methods have shown promise in video reconstruction from events, but this problem is not completely solved yet.
arXiv Detail & Related papers (2023-04-30T09:28:38Z) - Cross-View Cross-Scene Multi-View Crowd Counting [56.83882084112913]
Multi-view crowd counting has been previously proposed to utilize multi-cameras to extend the field-of-view of a single camera.
We propose a cross-view cross-scene (CVCS) multi-view crowd counting paradigm, where the training and testing occur on different scenes with arbitrary camera layouts.
arXiv Detail & Related papers (2022-05-03T15:03:44Z) - Learning to Deblur and Generate High Frame Rate Video with an Event
Camera [0.0]
Event cameras do not suffer from motion blur when recording high-speed scenes.
We formulate the deblurring task on traditional cameras directed by events to be a residual learning one.
We propose corresponding network architectures for effective learning of deblurring and high frame rate video generation tasks.
arXiv Detail & Related papers (2020-03-02T13:02:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.