Related papers: Video BagNet: short temporal receptive fields increase robustness in long-term action recognition

Video BagNet: short temporal receptive fields increase robustness in long-term action recognition

URL: http://arxiv.org/abs/2308.11249v1
Date: Tue, 22 Aug 2023 07:44:59 GMT
Title: Video BagNet: short temporal receptive fields increase robustness in long-term action recognition
Authors: Ombretta Strafforello, Xin Liu, Klamer Schutte, Jan van Gemert
Abstract summary: A large temporal receptive field allows the model to encode the exact sub-action order of a video. We investigate whether we can improve the model robustness to the sub-action order by shrinking the temporal receptive field. We find that short receptive fields are robust to sub-action order changes, while larger temporal receptive fields are sensitive to the sub-action order.
Score: 11.452704540879513
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Previous work on long-term video action recognition relies on deep 3D-convolutional models that have a large temporal receptive field (RF). We argue that these models are not always the best choice for temporal modeling in videos. A large temporal receptive field allows the model to encode the exact sub-action order of a video, which causes a performance decrease when testing videos have a different sub-action order. In this work, we investigate whether we can improve the model robustness to the sub-action order by shrinking the temporal receptive field of action recognition models. For this, we design Video BagNet, a variant of the 3D ResNet-50 model with the temporal receptive field size limited to 1, 9, 17 or 33 frames. We analyze Video BagNet on synthetic and real-world video datasets and experimentally compare models with varying temporal receptive fields. We find that short receptive fields are robust to sub-action order changes, while larger temporal receptive fields are sensitive to the sub-action order.

Related papers

Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models [83.76517697509156]
This paper addresses the challenge of high-fidelity view synthesis of humans with sparse-view videos as input.<n>We propose a novel iterative sliding denoising process to enhance view-temporal consistency of the 4D diffusion model.<n>Our method is able to synthesize high-quality and consistent novel-view videos and significantly outperforms the existing approaches.
arXiv Detail & Related papers (2025-07-17T17:59:17Z)
Balancing long- and short-term dynamics for the modeling of saliency in videos [14.527351636175615]
We present a Transformer-based approach to learn a joint representation of video frames and past saliency information. Our model embeds long- and short-term information to detect dynamically shifting saliency in video.
arXiv Detail & Related papers (2025-04-08T11:09:37Z)
Long-Context Autoregressive Video Modeling with Next-Frame Prediction [17.710915002557996]
Long-context video modeling is essential for enabling generative models to function as world simulators.<n>While training directly on long videos is a natural solution, the rapid growth of vision tokens makes it computationally prohibitive.<n>We propose Frame AutoRegressive (FAR) models temporal dependencies between continuous frames, converges faster than video diffusion transformers, and outperforms token-level autoregressive models.
arXiv Detail & Related papers (2025-03-25T03:38:06Z)
RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks [93.18404922542702]
We present a novel video generative model designed to address long-term spatial and temporal dependencies. Our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks. Our model synthesizes high-fidelity video clips at a resolution of $256times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps.
arXiv Detail & Related papers (2024-01-11T16:48:44Z)
Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding [57.917616284917756]
Real-world videos are often several minutes long with semantically consistent segments of variable length. A common approach to process long videos is applying a short-form video model over uniformly sampled clips of fixed temporal length. This approach neglects the underlying nature of long videos since fixed-length clips are often redundant or uninformative.
arXiv Detail & Related papers (2023-09-20T18:13:32Z)
Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling [28.530765643908083]
We decouple spatial-temporal modeling and integrate an image- and a video-language to learn fine-grained visual understanding. We propose a novel pre-training objective, Temporal Referring Modeling, which requires the model to identify temporal positions of events in video sequences. Our model outperforms previous work pre-trained on orders of magnitude larger datasets.
arXiv Detail & Related papers (2022-10-08T07:03:31Z)
Temporally Consistent Transformers for Video Generation [80.45230642225913]
To generate accurate videos, algorithms have to understand the spatial and temporal dependencies in the world. No established benchmarks on complex data exist for rigorously evaluating video generation with long temporal dependencies. We introduce the Temporally Consistent Transformer (TECO), a generative model that substantially improves long-term consistency while also reducing sampling time.
arXiv Detail & Related papers (2022-10-05T17:15:10Z)
Streaming Video Temporal Action Segmentation In Real Time [2.8728707559692475]
We propose a real-time end-to-end multi-modality model for streaming video real-time temporal action segmentation task. Our model segments human action in real time with less than 40% of state-of-the-art model computation and achieves 90% of the accuracy of the full video state-of-the-art model.
arXiv Detail & Related papers (2022-09-28T03:27:37Z)
STRPM: A Spatiotemporal Residual Predictive Model for High-Resolution Video Prediction [78.129039340528]
We propose a StemporalResidual Predictive Model (STRPM) for high-resolution video prediction. STRPM can generate more satisfactory results compared with various existing methods. Experimental results show that STRPM can generate more satisfactory results compared with various existing methods.
arXiv Detail & Related papers (2022-03-30T06:24:00Z)
Multiview Transformers for Video Recognition [69.50552269271526]
We present Multiview Video Recognition (MTV) for different resolutions. MTV consistently performs better than single-view counterparts in terms of accuracy and computational cost. We achieve state-of-the-art results on five standard datasets, and improve even further with large-scale pretraining.
arXiv Detail & Related papers (2022-01-12T03:33:57Z)
Insights from Generative Modeling for Neural Video Compression [31.59496634465347]
We present newly proposed neural video coding algorithms through the lens of deep autoregressive and latent variable modeling. We propose several architectures that yield state-of-the-art video compression performance on high-resolution video. We provide further evidence that the generative modeling viewpoint can advance the neural video coding field.
arXiv Detail & Related papers (2021-07-28T02:19:39Z)
DSANet: Dynamic Segment Aggregation Network for Video-Level Representation Learning [29.182482776910152]
We coined Kinetic-range and short-range temporal modeling as key aspects of video recognition. In this paper, we introduce Dynamic Segment Aggregation (DSA) module to capture relationship among snippets. Our proposed DSA module is shown to benefit various video recognition models significantly.
arXiv Detail & Related papers (2021-05-25T17:09:57Z)
A Real-time Action Representation with Temporal Encoding and Deep Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation. T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed. Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.