Streaming Video Model
- URL: http://arxiv.org/abs/2303.17228v1
- Date: Thu, 30 Mar 2023 08:51:49 GMT
- Title: Streaming Video Model
- Authors: Yucheng Zhao, Chong Luo, Chuanxin Tang, Dongdong Chen, Noel Codella,
Zheng-Jun Zha
- Abstract summary: We propose to unify video understanding tasks into one streaming video architecture, referred to as Streaming Vision Transformer (S-ViT)
S-ViT first produces frame-level features with a memory-enabled temporally-aware spatial encoder to serve frame-based video tasks.
The efficiency and efficacy of S-ViT is demonstrated by the state-of-the-art accuracy in the sequence-based action recognition.
- Score: 90.24390609039335
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Video understanding tasks have traditionally been modeled by two separate
architectures, specially tailored for two distinct tasks. Sequence-based video
tasks, such as action recognition, use a video backbone to directly extract
spatiotemporal features, while frame-based video tasks, such as multiple object
tracking (MOT), rely on single fixed-image backbone to extract spatial
features. In contrast, we propose to unify video understanding tasks into one
novel streaming video architecture, referred to as Streaming Vision Transformer
(S-ViT). S-ViT first produces frame-level features with a memory-enabled
temporally-aware spatial encoder to serve the frame-based video tasks. Then the
frame features are input into a task-related temporal decoder to obtain
spatiotemporal features for sequence-based tasks. The efficiency and efficacy
of S-ViT is demonstrated by the state-of-the-art accuracy in the sequence-based
action recognition task and the competitive advantage over conventional
architecture in the frame-based MOT task. We believe that the concept of
streaming video model and the implementation of S-ViT are solid steps towards a
unified deep learning architecture for video understanding. Code will be
available at https://github.com/yuzhms/Streaming-Video-Model.
Related papers
- TANGO: Co-Speech Gesture Video Reenactment with Hierarchical Audio Motion Embedding and Diffusion Interpolation [4.019144083959918]
We present TANGO, a framework for generating co-speech body-gesture videos.
Given a few-minute, single-speaker reference video, TANGO produces high-fidelity videos with synchronized body gestures.
arXiv Detail & Related papers (2024-10-05T16:30:46Z) - Spatio-temporal Prompting Network for Robust Video Feature Extraction [74.54597668310707]
Frametemporal is one of the main challenges in the field of video understanding.
Recent approaches exploit transformer-based integration modules to obtain quality-of-temporal information.
We present a neat and unified framework called N-Temporal Prompting Network (NNSTP)
It can efficiently extract video features by adjusting the input features in the network backbone.
arXiv Detail & Related papers (2024-02-04T17:52:04Z) - Multi-entity Video Transformers for Fine-Grained Video Representation
Learning [36.31020249963468]
We re-examine the design of transformer architectures for video representation learning.
A salient aspect of our self-supervised method is the improved integration of spatial information in the temporal pipeline.
Our Multi-entity Video Transformer (MV-Former) architecture achieves state-of-the-art results on multiple fine-grained video benchmarks.
arXiv Detail & Related papers (2023-11-17T21:23:12Z) - Task Agnostic Restoration of Natural Video Dynamics [10.078712109708592]
In many video restoration/translation tasks, image processing operations are na"ively extended to the video domain by processing each frame independently.
We propose a general framework for this task that learns to infer and utilize consistent motion dynamics from inconsistent videos to mitigate the temporal flicker.
The proposed framework produces SOTA results on two benchmark datasets, DAVIS and videvo.net, processed by numerous image processing applications.
arXiv Detail & Related papers (2022-06-08T09:00:31Z) - Exploiting long-term temporal dynamics for video captioning [40.15826846670479]
We propose a novel approach, namely temporal and spatial LSTM (TS-LSTM), which systematically exploits spatial and temporal dynamics within video sequences.
Experimental results obtained in two public video captioning benchmarks indicate that our TS-LSTM outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2022-02-22T11:40:09Z) - Condensing a Sequence to One Informative Frame for Video Recognition [113.3056598548736]
This paper studies a two-step alternative that first condenses the video sequence to an informative "frame"
A valid question is how to define "useful information" and then distill from a sequence down to one synthetic frame.
IFS consistently demonstrates evident improvements on image-based 2D networks and clip-based 3D networks.
arXiv Detail & Related papers (2022-01-11T16:13:43Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z) - Video Exploration via Video-Specific Autoencoders [60.256055890647595]
We present video-specific autoencoders that enables human-controllable video exploration.
We observe that a simple autoencoder trained on multiple frames of a specific video enables one to perform a large variety of video processing and editing tasks.
arXiv Detail & Related papers (2021-03-31T17:56:13Z) - Dual Temporal Memory Network for Efficient Video Object Segmentation [42.05305410986511]
One of the fundamental challenges in Video Object (VOS) is how to make the most use of the temporal information to boost the performance.
We present an end-to-end network which stores short- and long-term video sequence information preceding the current frame as the temporal memories.
Our network consists of two temporal sub-networks including a short-term memory sub-network and a long-term memory sub-network.
arXiv Detail & Related papers (2020-03-13T06:07:45Z) - Fine-Grained Instance-Level Sketch-Based Video Retrieval [159.12935292432743]
We propose a novel cross-modal retrieval problem of fine-grained instance-level sketch-based video retrieval (FG-SBVR)
Compared with sketch-based still image retrieval, and coarse-grained category-level video retrieval, this is more challenging as both visual appearance and motion need to be simultaneously matched at a fine-grained level.
We show that this model significantly outperforms a number of existing state-of-the-art models designed for video analysis.
arXiv Detail & Related papers (2020-02-21T18:28:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.