Video BagNet: short temporal receptive fields increase robustness in
long-term action recognition
- URL: http://arxiv.org/abs/2308.11249v1
- Date: Tue, 22 Aug 2023 07:44:59 GMT
- Title: Video BagNet: short temporal receptive fields increase robustness in
long-term action recognition
- Authors: Ombretta Strafforello, Xin Liu, Klamer Schutte, Jan van Gemert
- Abstract summary: A large temporal receptive field allows the model to encode the exact sub-action order of a video.
We investigate whether we can improve the model robustness to the sub-action order by shrinking the temporal receptive field.
We find that short receptive fields are robust to sub-action order changes, while larger temporal receptive fields are sensitive to the sub-action order.
- Score: 11.452704540879513
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Previous work on long-term video action recognition relies on deep
3D-convolutional models that have a large temporal receptive field (RF). We
argue that these models are not always the best choice for temporal modeling in
videos. A large temporal receptive field allows the model to encode the exact
sub-action order of a video, which causes a performance decrease when testing
videos have a different sub-action order. In this work, we investigate whether
we can improve the model robustness to the sub-action order by shrinking the
temporal receptive field of action recognition models. For this, we design
Video BagNet, a variant of the 3D ResNet-50 model with the temporal receptive
field size limited to 1, 9, 17 or 33 frames. We analyze Video BagNet on
synthetic and real-world video datasets and experimentally compare models with
varying temporal receptive fields. We find that short receptive fields are
robust to sub-action order changes, while larger temporal receptive fields are
sensitive to the sub-action order.
Related papers
- Balancing long- and short-term dynamics for the modeling of saliency in videos [14.527351636175615]
We present a Transformer-based approach to learn a joint representation of video frames and past saliency information.
Our model embeds long- and short-term information to detect dynamically shifting saliency in video.
arXiv Detail & Related papers (2025-04-08T11:09:37Z) - RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks [93.18404922542702]
We present a novel video generative model designed to address long-term spatial and temporal dependencies.
Our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks.
Our model synthesizes high-fidelity video clips at a resolution of $256times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps.
arXiv Detail & Related papers (2024-01-11T16:48:44Z) - Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for
Long-form Video Understanding [57.917616284917756]
Real-world videos are often several minutes long with semantically consistent segments of variable length.
A common approach to process long videos is applying a short-form video model over uniformly sampled clips of fixed temporal length.
This approach neglects the underlying nature of long videos since fixed-length clips are often redundant or uninformative.
arXiv Detail & Related papers (2023-09-20T18:13:32Z) - Learning Fine-Grained Visual Understanding for Video Question Answering
via Decoupling Spatial-Temporal Modeling [28.530765643908083]
We decouple spatial-temporal modeling and integrate an image- and a video-language to learn fine-grained visual understanding.
We propose a novel pre-training objective, Temporal Referring Modeling, which requires the model to identify temporal positions of events in video sequences.
Our model outperforms previous work pre-trained on orders of magnitude larger datasets.
arXiv Detail & Related papers (2022-10-08T07:03:31Z) - Temporally Consistent Transformers for Video Generation [80.45230642225913]
To generate accurate videos, algorithms have to understand the spatial and temporal dependencies in the world.
No established benchmarks on complex data exist for rigorously evaluating video generation with long temporal dependencies.
We introduce the Temporally Consistent Transformer (TECO), a generative model that substantially improves long-term consistency while also reducing sampling time.
arXiv Detail & Related papers (2022-10-05T17:15:10Z) - Streaming Video Temporal Action Segmentation In Real Time [2.8728707559692475]
We propose a real-time end-to-end multi-modality model for streaming video real-time temporal action segmentation task.
Our model segments human action in real time with less than 40% of state-of-the-art model computation and achieves 90% of the accuracy of the full video state-of-the-art model.
arXiv Detail & Related papers (2022-09-28T03:27:37Z) - STRPM: A Spatiotemporal Residual Predictive Model for High-Resolution
Video Prediction [78.129039340528]
We propose a StemporalResidual Predictive Model (STRPM) for high-resolution video prediction.
STRPM can generate more satisfactory results compared with various existing methods.
Experimental results show that STRPM can generate more satisfactory results compared with various existing methods.
arXiv Detail & Related papers (2022-03-30T06:24:00Z) - Multiview Transformers for Video Recognition [69.50552269271526]
We present Multiview Video Recognition (MTV) for different resolutions.
MTV consistently performs better than single-view counterparts in terms of accuracy and computational cost.
We achieve state-of-the-art results on five standard datasets, and improve even further with large-scale pretraining.
arXiv Detail & Related papers (2022-01-12T03:33:57Z) - Insights from Generative Modeling for Neural Video Compression [31.59496634465347]
We present newly proposed neural video coding algorithms through the lens of deep autoregressive and latent variable modeling.
We propose several architectures that yield state-of-the-art video compression performance on high-resolution video.
We provide further evidence that the generative modeling viewpoint can advance the neural video coding field.
arXiv Detail & Related papers (2021-07-28T02:19:39Z) - DSANet: Dynamic Segment Aggregation Network for Video-Level
Representation Learning [29.182482776910152]
We coined Kinetic-range and short-range temporal modeling as key aspects of video recognition.
In this paper, we introduce Dynamic Segment Aggregation (DSA) module to capture relationship among snippets.
Our proposed DSA module is shown to benefit various video recognition models significantly.
arXiv Detail & Related papers (2021-05-25T17:09:57Z) - A Real-time Action Representation with Temporal Encoding and Deep
Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation.
T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed.
Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.