Spatio-temporal Prompting Network for Robust Video Feature Extraction
- URL: http://arxiv.org/abs/2402.02574v1
- Date: Sun, 4 Feb 2024 17:52:04 GMT
- Title: Spatio-temporal Prompting Network for Robust Video Feature Extraction
- Authors: Guanxiong Sun, Chi Wang, Zhaoyu Zhang, Jiankang Deng, Stefanos
Zafeiriou, Yang Hua
- Abstract summary: Frametemporal is one of the main challenges in the field of video understanding.
Recent approaches exploit transformer-based integration modules to obtain quality-of-temporal information.
We present a neat and unified framework called N-Temporal Prompting Network (NNSTP)
It can efficiently extract video features by adjusting the input features in the network backbone.
- Score: 74.54597668310707
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Frame quality deterioration is one of the main challenges in the field of
video understanding. To compensate for the information loss caused by
deteriorated frames, recent approaches exploit transformer-based integration
modules to obtain spatio-temporal information. However, these integration
modules are heavy and complex. Furthermore, each integration module is
specifically tailored for its target task, making it difficult to generalise to
multiple tasks. In this paper, we present a neat and unified framework, called
Spatio-Temporal Prompting Network (STPN). It can efficiently extract robust and
accurate video features by dynamically adjusting the input features in the
backbone network. Specifically, STPN predicts several video prompts containing
spatio-temporal information of neighbour frames. Then, these video prompts are
prepended to the patch embeddings of the current frame as the updated input for
video feature extraction. Moreover, STPN is easy to generalise to various video
tasks because it does not contain task-specific modules. Without bells and
whistles, STPN achieves state-of-the-art performance on three widely-used
datasets for different video understanding tasks, i.e., ImageNetVID for video
object detection, YouTubeVIS for video instance segmentation, and GOT-10k for
visual object tracking. Code is available at
https://github.com/guanxiongsun/vfe.pytorch.
Related papers
- OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer [14.503628667535425]
processing extensive videos presents significant challenges due to the vast data and processing demands.
We develop OmAgent, efficiently stores and retrieves relevant video frames for specific queries.
It features an Divide-and-Conquer Loop capable of autonomous reasoning.
We have endowed it with greater autonomy and a robust tool-calling system, enabling it to accomplish even more intricate tasks.
arXiv Detail & Related papers (2024-06-24T13:05:39Z) - LongVLM: Efficient Long Video Understanding via Large Language Models [55.813206751150716]
LongVLM is a simple yet powerful VideoLLM for long video understanding.
We encode video representations that incorporate both local and global information.
Our model produces more precise responses for long video understanding.
arXiv Detail & Related papers (2024-04-04T11:33:29Z) - TAM-VT: Transformation-Aware Multi-scale Video Transformer for Segmentation and Tracking [33.75267864844047]
Video Object (VOS) has emerged as an increasingly important problem with availability of larger datasets and more complex and realistic settings.
We propose a novel, clip-based DETR-style encoder-decoder architecture, which focuses on systematically analyzing and addressing aforementioned challenges.
Specifically, we propose a novel transformation-aware loss that focuses learning on portions of the video where an object undergoes significant deformations.
arXiv Detail & Related papers (2023-12-13T21:02:03Z) - Multi-entity Video Transformers for Fine-Grained Video Representation
Learning [36.31020249963468]
We re-examine the design of transformer architectures for video representation learning.
A salient aspect of our self-supervised method is the improved integration of spatial information in the temporal pipeline.
Our Multi-entity Video Transformer (MV-Former) architecture achieves state-of-the-art results on multiple fine-grained video benchmarks.
arXiv Detail & Related papers (2023-11-17T21:23:12Z) - TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language
Understanding [20.16000249533665]
TESTA condenses video semantics by adaptively aggregating similar frames, as well as similar patches within each frame.
Building upon TESTA, we introduce a pre-trained video-language model equipped with a divided space-time token aggregation module in each video block.
We evaluate our model on five datasets for paragraph-to-video retrieval and long-form VideoQA tasks.
arXiv Detail & Related papers (2023-10-29T16:25:32Z) - Streaming Video Model [90.24390609039335]
We propose to unify video understanding tasks into one streaming video architecture, referred to as Streaming Vision Transformer (S-ViT)
S-ViT first produces frame-level features with a memory-enabled temporally-aware spatial encoder to serve frame-based video tasks.
The efficiency and efficacy of S-ViT is demonstrated by the state-of-the-art accuracy in the sequence-based action recognition.
arXiv Detail & Related papers (2023-03-30T08:51:49Z) - You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z) - Temporal Complementary Learning for Video Person Re-Identification [110.43147302200101]
This paper proposes a Temporal Complementary Learning Network that extracts complementary features of consecutive video frames for video person re-identification.
A saliency erasing operation drives the specific learner to mine new and complementary parts by erasing the parts activated by previous frames.
A Temporal Saliency Boosting (TSB) module is designed to propagate the salient information among video frames to enhance the salient feature.
arXiv Detail & Related papers (2020-07-18T07:59:01Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.