No frame left behind: Full Video Action Recognition
- URL: http://arxiv.org/abs/2103.15395v1
- Date: Mon, 29 Mar 2021 07:44:28 GMT
- Title: No frame left behind: Full Video Action Recognition
- Authors: Xin Liu, Silvia L. Pintea, Fatemeh Karimi Nejadasl, Olaf Booij, Jan C.
van Gemert
- Abstract summary: We propose full video action recognition and consider all video frames.
We first cluster all frame activations along the temporal dimension.
We then temporally aggregate the frames in the clusters into a smaller number of representations.
- Score: 26.37329995193377
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Not all video frames are equally informative for recognizing an action. It is
computationally infeasible to train deep networks on all video frames when
actions develop over hundreds of frames. A common heuristic is uniformly
sampling a small number of video frames and using these to recognize the
action. Instead, here we propose full video action recognition and consider all
video frames. To make this computational tractable, we first cluster all frame
activations along the temporal dimension based on their similarity with respect
to the classification task, and then temporally aggregate the frames in the
clusters into a smaller number of representations. Our method is end-to-end
trainable and computationally efficient as it relies on temporally localized
clustering in combination with fast Hamming distances in feature space. We
evaluate on UCF101, HMDB51, Breakfast, and Something-Something V1 and V2, where
we compare favorably to existing heuristic frame sampling methods.
Related papers
- Explorative Inbetweening of Time and Space [46.77750028273578]
We introduce bounded generation to control video generation based only on a given start and end frame.
Time Reversal Fusion fuses the temporally forward and backward denoising paths conditioned on the start and end frame.
We find that Time Reversal Fusion outperforms related work on all subtasks.
arXiv Detail & Related papers (2024-03-21T17:57:31Z) - Aggregating Long-term Sharp Features via Hybrid Transformers for Video
Deblurring [76.54162653678871]
We propose a video deblurring method that leverages both neighboring frames and present sharp frames using hybrid Transformers for feature aggregation.
Our proposed method outperforms state-of-the-art video deblurring methods as well as event-driven video deblurring methods in terms of quantitative metrics and visual quality.
arXiv Detail & Related papers (2023-09-13T16:12:11Z) - TTVFI: Learning Trajectory-Aware Transformer for Video Frame
Interpolation [50.49396123016185]
Video frame (VFI) aims to synthesize an intermediate frame between two consecutive frames.
We propose a novel Trajectory-aware Transformer for Video Frame Interpolation (TTVFI)
Our method outperforms other state-of-the-art methods in four widely-used VFI benchmarks.
arXiv Detail & Related papers (2022-07-19T03:37:49Z) - Efficient Video Segmentation Models with Per-frame Inference [117.97423110566963]
We focus on improving the temporal consistency without introducing overhead in inference.
We propose several techniques to learn from the video sequence, including a temporal consistency loss and online/offline knowledge distillation methods.
arXiv Detail & Related papers (2022-02-24T23:51:36Z) - OCSampler: Compressing Videos to One Clip with Single-step Sampling [82.0417131211353]
We propose a framework named OCSampler to explore a compact yet effective video representation with one short clip.
Our basic motivation is that the efficient video recognition task lies in processing a whole sequence at once rather than picking up frames sequentially.
arXiv Detail & Related papers (2022-01-12T09:50:38Z) - Video Instance Segmentation using Inter-Frame Communication Transformers [28.539742250704695]
Recently, the per-clip pipeline shows superior performance over per-frame methods.
Previous per-clip models require heavy computation and memory usage to achieve frame-to-frame communications.
We propose Inter-frame Communication Transformers (IFC), which significantly reduces the overhead for information-passing between frames.
arXiv Detail & Related papers (2021-06-07T02:08:39Z) - An Image is Worth 16x16 Words, What is a Video Worth? [14.056790511123866]
Methods that reach State of the Art (SotA) accuracy usually make use of 3D convolution layers as a way to abstract the temporal information from video frames.
Since each short clip covers a small fraction of an input video, multiple clips are sampled at inference in order to cover the whole temporal length of the video.
We address the computational bottleneck by significantly reducing the number of frames required for inference.
arXiv Detail & Related papers (2021-03-25T15:25:17Z) - SF-Net: Single-Frame Supervision for Temporal Action Localization [60.202516362976645]
Single-frame supervision introduces extra temporal action signals while maintaining low annotation overhead.
We propose a unified system called SF-Net to make use of such single-frame supervision.
SF-Net significantly improves upon state-of-the-art weakly-supervised methods in terms of both segment localization and single-frame localization.
arXiv Detail & Related papers (2020-03-15T15:06:01Z) - Efficient Semantic Video Segmentation with Per-frame Inference [117.97423110566963]
In this work, we process efficient semantic video segmentation in a per-frame fashion during the inference process.
We employ compact models for real-time execution. To narrow the performance gap between compact models and large models, new knowledge distillation methods are designed.
arXiv Detail & Related papers (2020-02-26T12:24:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.