LocFormer: Enabling Transformers to Perform Temporal Moment Localization
on Long Untrimmed Videos With a Feature Sampling Approach
- URL: http://arxiv.org/abs/2112.10066v1
- Date: Sun, 19 Dec 2021 05:32:14 GMT
- Title: LocFormer: Enabling Transformers to Perform Temporal Moment Localization
on Long Untrimmed Videos With a Feature Sampling Approach
- Authors: Cristian Rodriguez-Opazo, Edison Marrese-Taylor, Basura Fernando,
Hiroya Takamura, Qi Wu
- Abstract summary: LocFormer is a Transformer-based model for video grounding that operates at a constant memory footprint regardless of the video length.
We propose a modular design that separates functionality, enabling us to learn an inductive bias via supervising the self-attention heads.
- Score: 35.93734845932161
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We propose LocFormer, a Transformer-based model for video grounding which
operates at a constant memory footprint regardless of the video length, i.e.
number of frames. LocFormer is designed for tasks where it is necessary to
process the entire long video and at its core lie two main contributions.
First, our model incorporates a new sampling technique that splits the input
feature sequence into a fixed number of sections and selects a single feature
per section using a stochastic approach, which allows us to obtain a feature
sample set that is representative of the video content for the task at hand
while keeping the memory footprint constant. Second, we propose a modular
design that separates functionality, enabling us to learn an inductive bias via
supervising the self-attention heads, while also effectively leveraging
pre-trained text and video encoders. We test our proposals on relevant
benchmark datasets for video grounding, showing that not only LocFormer can
achieve excellent results including state-of-the-art performance on YouCookII,
but also that our sampling technique is more effective than competing
counterparts and that it consistently improves the performance of prior work,
by up to 3.13\% in the mean temporal IoU, ultimately leading to a new
state-of-the-art performance on Charades-STA.
Related papers
- Text-Conditioned Resampler For Long Form Video Understanding [94.81955667020867]
We present a text-conditioned video resampler (TCR) module that uses a pre-trained visual encoder and large language model (LLM)
TCR can process more than 100 frames at a time with plain attention and without optimised implementations.
arXiv Detail & Related papers (2023-12-19T06:42:47Z) - View while Moving: Efficient Video Recognition in Long-untrimmed Videos [17.560160747282147]
We propose a novel recognition paradigm "View while Moving" for efficient long-untrimmed video recognition.
In contrast to the two-stage paradigm, our paradigm only needs to access the raw frame once.
Our approach outperforms state-of-the-art methods in terms of accuracy as well as efficiency.
arXiv Detail & Related papers (2023-08-09T09:46:26Z) - TAPIR: Tracking Any Point with per-frame Initialization and temporal
Refinement [64.11385310305612]
We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence.
Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations.
The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS.
arXiv Detail & Related papers (2023-06-14T17:07:51Z) - You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z) - Spatio-Temporal Crop Aggregation for Video Representation Learning [33.296154476701055]
Our model builds long-range video features by learning from sets of video clip-level features extracted with a pre-trained backbone.
We demonstrate that our video representation yields state-of-the-art performance with linear, non-linear, and $k$-NN probing on common action classification datasets.
arXiv Detail & Related papers (2022-11-30T14:43:35Z) - Task-adaptive Spatial-Temporal Video Sampler for Few-shot Action
Recognition [25.888314212797436]
We propose a novel video frame sampler for few-shot action recognition.
Task-specific spatial-temporal frame sampling is achieved via a temporal selector (TS) and a spatial amplifier (SA)
Experiments show a significant boost on various benchmarks including long-term videos.
arXiv Detail & Related papers (2022-07-20T09:04:12Z) - Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z) - TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization
Tasks [79.01176229586855]
We propose a novel supervised pretraining paradigm for clip features that considers background clips and global video information to improve temporal sensitivity.
Extensive experiments show that using features trained with our novel pretraining strategy significantly improves the performance of recent state-of-the-art methods on three tasks.
arXiv Detail & Related papers (2020-11-23T15:40:15Z) - Fast Video Object Segmentation With Temporal Aggregation Network and
Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS)
We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance.
We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.