Video Exploration via Video-Specific Autoencoders
- URL: http://arxiv.org/abs/2103.17261v1
- Date: Wed, 31 Mar 2021 17:56:13 GMT
- Title: Video Exploration via Video-Specific Autoencoders
- Authors: Kevin Wang and Deva Ramanan and Aayush Bansal
- Abstract summary: We present video-specific autoencoders that enables human-controllable video exploration.
We observe that a simple autoencoder trained on multiple frames of a specific video enables one to perform a large variety of video processing and editing tasks.
- Score: 60.256055890647595
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present simple video-specific autoencoders that enables human-controllable
video exploration. This includes a wide variety of analytic tasks such as (but
not limited to) spatial and temporal super-resolution, spatial and temporal
editing, object removal, video textures, average video exploration, and
correspondence estimation within and across videos. Prior work has
independently looked at each of these problems and proposed different
formulations. In this work, we observe that a simple autoencoder trained (from
scratch) on multiple frames of a specific video enables one to perform a large
variety of video processing and editing tasks. Our tasks are enabled by two key
observations: (1) latent codes learned by the autoencoder capture spatial and
temporal properties of that video and (2) autoencoders can project
out-of-sample inputs onto the video-specific manifold. For e.g. (1)
interpolating latent codes enables temporal super-resolution and
user-controllable video textures; (2) manifold reprojection enables spatial
super-resolution, object removal, and denoising without training for any of the
tasks. Importantly, a two-dimensional visualization of latent codes via
principal component analysis acts as a tool for users to both visualize and
intuitively control video edits. Finally, we quantitatively contrast our
approach with the prior art and found that without any supervision and
task-specific knowledge, our approach can perform comparably to supervised
approaches specifically trained for a task.
Related papers
- Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets [62.280729345770936]
We introduce the task of Alignable Video Retrieval (AVR)
Given a query video, our approach can identify well-alignable videos from a large collection of clips and temporally synchronize them to the query.
Our experiments on 3 datasets, including large-scale Kinetics700, demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2024-09-02T20:00:49Z) - OmniVid: A Generative Framework for Universal Video Understanding [133.73878582161387]
We seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens.
This enables us to address various types of video tasks, including classification, captioning, and localization.
We demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results.
arXiv Detail & Related papers (2024-03-26T17:59:24Z) - Knowledge-enhanced Multi-perspective Video Representation Learning for
Scene Recognition [33.800842679024164]
We address the problem of video scene recognition, whose goal is to learn a high-level video representation to classify scenes in videos.
Most existing works identify scenes for videos only from visual or textual information in a temporal perspective.
We propose a novel two-stream framework to model video representations from multiple perspectives.
arXiv Detail & Related papers (2024-01-09T04:37:10Z) - MINOTAUR: Multi-task Video Grounding From Multimodal Queries [70.08973664126873]
We present a single, unified model for tackling query-based video understanding in long-form videos.
In particular, our model can address all three tasks of the Ego4D Episodic Memory benchmark.
arXiv Detail & Related papers (2023-02-16T04:00:03Z) - Autoencoding Video Latents for Adversarial Video Generation [0.0]
AVLAE is a two stream latent autoencoder where the video distribution is learned by adversarial training.
We demonstrate that our approach learns to disentangle motion and appearance codes even without the explicit structural composition in the generator.
arXiv Detail & Related papers (2022-01-18T11:42:14Z) - A Survey on Deep Learning Technique for Video Segmentation [147.0767454918527]
Video segmentation plays a critical role in a broad range of practical applications.
Deep learning based approaches have been dedicated to video segmentation and delivered compelling performance.
arXiv Detail & Related papers (2021-07-02T15:51:07Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Human-Machine Collaborative Video Coding Through Cuboidal Partitioning [26.70051123157869]
We propose a video coding framework by leveraging on to the commonality that exists between human vision and machine vision applications using cuboids.
Cuboids, estimated rectangular regions over a video frame, are computationally efficient, has a compact representation and object centric.
Herein cuboidal feature descriptors are extracted from the current frame and then employed for accomplishing a machine vision task in the form of object detection.
arXiv Detail & Related papers (2021-02-02T04:44:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.