Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers
- URL: http://arxiv.org/abs/2106.05392v1
- Date: Wed, 9 Jun 2021 21:16:05 GMT
- Title: Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers
- Authors: Mandela Patrick, Dylan Campbell, Yuki M. Asano, Ishan Misra Florian
Metze, Christoph Feichtenhofer, Andrea Vedaldi, Jo\\~ao F. Henriques
- Abstract summary: We present a new drop-in block for video transformers that aggregates information along implicitly determined motion paths.
We also propose a new method to address the quadratic dependence of computation and memory on the input size.
We obtain state-of-the-art results on the Kinetics, Something--Something V2, and Epic-Kitchens datasets.
- Score: 77.52828273633646
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In video transformers, the time dimension is often treated in the same way as
the two spatial dimensions. However, in a scene where objects or the camera may
move, a physical point imaged at one location in frame $t$ may be entirely
unrelated to what is found at that location in frame $t+k$. These temporal
correspondences should be modeled to facilitate learning about dynamic scenes.
To this end, we propose a new drop-in block for video transformers --
trajectory attention -- that aggregates information along implicitly determined
motion paths. We additionally propose a new method to address the quadratic
dependence of computation and memory on the input size, which is particularly
important for high resolution or long videos. While these ideas are useful in a
range of settings, we apply them to the specific task of video action
recognition with a transformer model and obtain state-of-the-art results on the
Kinetics, Something--Something V2, and Epic-Kitchens datasets. Code and models
are available at: https://github.com/facebookresearch/Motionformer
Related papers
- Controllable Longer Image Animation with Diffusion Models [12.565739255499594]
We introduce an open-domain controllable image animation method using motion priors with video diffusion models.
Our method achieves precise control over the direction and speed of motion in the movable region by extracting the motion field information from videos.
We propose an efficient long-duration video generation method based on noise reschedule specifically tailored for image animation tasks.
arXiv Detail & Related papers (2024-05-27T16:08:00Z) - VMC: Video Motion Customization using Temporal Attention Adaption for
Text-to-Video Diffusion Models [58.93124686141781]
Video Motion Customization (VMC) is a novel one-shot tuning approach crafted to adapt temporal attention layers within video diffusion models.
Our approach introduces a novel motion distillation objective using residual vectors between consecutive frames as a motion reference.
We validate our method against state-of-the-art video generative models across diverse real-world motions and contexts.
arXiv Detail & Related papers (2023-12-01T06:50:11Z) - PointOdyssey: A Large-Scale Synthetic Dataset for Long-Term Point
Tracking [90.29143475328506]
We introduce PointOdyssey, a large-scale synthetic dataset, and data generation framework.
Our goal is to advance the state-of-the-art by placing emphasis on long videos with naturalistic motion.
We animate deformable characters using real-world motion capture data, we build 3D scenes to match the motion capture environments, and we render camera viewpoints using trajectories mined via structure-from-motion on real videos.
arXiv Detail & Related papers (2023-07-27T17:58:11Z) - Motion Transformer for Unsupervised Image Animation [37.35527776043379]
Image animation aims to animate a source image by using motion learned from a driving video.
Current state-of-the-art methods typically use convolutional neural networks (CNNs) to predict motion information.
We propose a new method, the motion transformer, which is the first attempt to build a motion estimator based on a vision transformer.
arXiv Detail & Related papers (2022-09-28T12:04:58Z) - Playable Environments: Video Manipulation in Space and Time [98.0621309257937]
We present Playable Environments - a new representation for interactive video generation and manipulation in space and time.
With a single image at inference time, our novel framework allows the user to move objects in 3D while generating a video by providing a sequence of desired actions.
Our method builds an environment state for each frame, which can be manipulated by our proposed action module and decoded back to the image space with volumetric rendering.
arXiv Detail & Related papers (2022-03-03T18:51:05Z) - Self-supervised Video Transformer [46.295395772938214]
From a given video, we create local and global views with varying spatial sizes and frame rates.
Our self-supervised objective seeks to match the features of different views representing the same video to be intemporal.
Our approach performs well on four action benchmarks and converges faster with small batch sizes.
arXiv Detail & Related papers (2021-12-02T18:59:02Z) - NeuralDiff: Segmenting 3D objects that move in egocentric videos [92.95176458079047]
We study the problem of decomposing the observed 3D scene into a static background and a dynamic foreground.
This task is reminiscent of the classic background subtraction problem, but is significantly harder because all parts of the scene, static and dynamic, generate a large apparent motion.
In particular, we consider egocentric videos and further separate the dynamic component into objects and the actor that observes and moves them.
arXiv Detail & Related papers (2021-10-19T12:51:35Z) - Object-Region Video Transformers [100.23380634952083]
We present Object-Region Transformers Video (ORViT), an emphobject-centric approach that extends transformer video layers with object representations.
Our ORViT block consists of two object-level streams: appearance and dynamics.
We show strong improvement in performance across all tasks and considered, demonstrating the value of a model that incorporates object representations into a transformer architecture.
arXiv Detail & Related papers (2021-10-13T17:51:46Z) - Generative Video Transformer: Can Objects be the Words? [22.788711301106765]
We propose the Object-Centric Video Transformer (OCVT) which utilizes an object-centric approach for decomposing scenes into tokens suitable for use in a generative video transformer.
By factoring video into objects, our fully unsupervised model is able to learn complex-temporal dynamics of multiple objects in a scene and generate future frames of the video.
Our model is also significantly more memory-efficient than pixel-based models and thus able to train on videos of length up to 70 frames with a single 48GB GPU.
arXiv Detail & Related papers (2021-07-20T03:08:39Z) - VideoLightFormer: Lightweight Action Recognition using Transformers [8.871042314510788]
We propose a novel, lightweight action recognition architecture, VideoLightFormer.
In a factorized fashion, we carefully extend the 2D convolutional Temporal Network with transformers.
We evaluate VideoLightFormer in a high-efficiency setting on the temporally-demanding EPIC-KITCHENS-100 and Something-SV-V-Something2 datasets.
arXiv Detail & Related papers (2021-07-01T13:55:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.