Related papers: TALLFormer: Temporal Action Localization with Long-memory Transformer

TALLFormer: Temporal Action Localization with Long-memory Transformer

URL: http://arxiv.org/abs/2204.01680v1
Date: Mon, 4 Apr 2022 17:51:20 GMT
Title: TALLFormer: Temporal Action Localization with Long-memory Transformer
Authors: Feng Cheng, Gedas Bertasius
Abstract summary: TALLFormer is a memory-efficient and end-to-end trainable temporal action localization transformer. Our long-term memory mechanism eliminates the need for processing hundreds of redundant video frames during each training iteration. With only RGB frames as input, TALLFormer outperforms previous state-of-the-art methods by a large margin.
Score: 16.208160001820044
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Most modern approaches in temporal action localization divide this problem into two parts: (i) short-term feature extraction and (ii) long-range temporal boundary localization. Due to the high GPU memory cost caused by processing long untrimmed videos, many methods sacrifice the representational power of the short-term feature extractor by either freezing the backbone or using a very small spatial video resolution. This issue becomes even worse with the recent video transformer models, many of which have quadratic memory complexity. To address these issues, we propose TALLFormer, a memory-efficient and end-to-end trainable Temporal Action Localization transformer with Long-term memory. Our long-term memory mechanism eliminates the need for processing hundreds of redundant video frames during each training iteration, thus, significantly reducing the GPU memory consumption and training time. These efficiency savings allow us (i) to use a powerful video transformer-based feature extractor without freezing the backbone or reducing the spatial video resolution, while (ii) also maintaining long-range temporal boundary localization capability. With only RGB frames as input and no external action recognition classifier, TALLFormer outperforms previous state-of-the-art methods by a large margin, achieving an average mAP of 59.1% on THUMOS14 and 35.6% on ActivityNet-1.3. The code will be available in https://github.com/klauscc/TALLFormer.

Related papers

Representing Long Volumetric Video with Temporal Gaussian Hierarchy [80.51373034419379]
This paper aims to address the challenge of reconstructing long volumetric videos from multi-view RGB videos. We propose a novel 4D representation, named Temporal Gaussian Hierarchy, to compactly model long volumetric videos. This work is the first approach capable of efficiently handling minutes of volumetric video data while maintaining state-of-the-art rendering quality.
arXiv Detail & Related papers (2024-12-12T18:59:34Z)
Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing [52.050036778325094]
Video-Ma$2$mba is a novel architecture that incorporates State Space Models (SSMs) within the Mamba-2 framework. Our approach significantly reduces the memory footprint compared to standard gradient checkpointing. By maintaining a detailed capture of temporal dynamics, our model improves the accuracy and relevance of responses in long video understanding tasks.
arXiv Detail & Related papers (2024-11-29T04:12:13Z)
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding [65.46303012350207]
LongVU is an adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos. We leverage DINOv2 features to remove redundant frames that exhibit high similarity. We perform spatial token reduction across frames based on their temporal dependencies.
arXiv Detail & Related papers (2024-10-22T21:21:37Z)
Online Temporal Action Localization with Memory-Augmented Transformer [61.39427407758131]
We propose a memory-augmented transformer (MATR) for online temporal action localization. MATR selectively preserves the past segment features, allowing to leverage long-term context for inference. We also propose a novel action localization method that observes the current input segment to predict the end time of the ongoing action and accesses the memory queue to estimate the start time of the action.
arXiv Detail & Related papers (2024-08-06T04:55:33Z)
Efficient Video Object Segmentation via Modulated Cross-Attention Memory [123.12273176475863]
We propose a transformer-based approach, named MAVOS, to model temporal smoothness without requiring frequent memory expansion. Our MAVOS achieves a J&F score of 63.3% while operating at 37 frames per second (FPS) on a single V100 GPU.
arXiv Detail & Related papers (2024-03-26T17:59:58Z)
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames [54.90226700939778]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion. We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z)
READMem: Robust Embedding Association for a Diverse Memory in Unconstrained Video Object Segmentation [24.813416082160224]
We present READMem, a modular framework for sVOS methods to handle unconstrained videos. We propose a robust association of the embeddings stored in the memory with query embeddings during the update process. Our approach achieves competitive results on the Long-time Video dataset (LV1) while not hindering performance on short sequences.
arXiv Detail & Related papers (2023-05-22T08:31:16Z)
Robust and Efficient Memory Network for Video Object Segmentation [6.7995672846437305]
This paper proposes a Robust and Efficient Memory Network, or REMN, for studying semi-supervised video object segmentation (VOS) We introduce a local attention mechanism that tackles the background distraction by enhancing the features of foreground objects with the previous mask. Experiments demonstrate that our REMN achieves state-of-the-art results on DAVIS 2017, with a $mathcalJ&F$ score of 86.3% and on YouTube-VOS 2018, with a $mathcalG$ over mean of 85.5%.
arXiv Detail & Related papers (2023-04-24T06:19:21Z)
XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model [137.50614198301733]
We present XMem, a video object segmentation architecture for long videos with unified feature memory stores. We develop an architecture that incorporates multiple independent yet deeply-connected feature memory stores. XMem greatly exceeds state-of-the-art performance on long-video datasets.
arXiv Detail & Related papers (2022-07-14T17:59:37Z)
MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition [74.35009770905968]
We build a memory-augmented vision transformer that has a temporal support 30x longer than existing models. MeMViT obtains state-of-the-art results on the AVA, EPIC-Kitchens-100 action classification, and action anticipation datasets.
arXiv Detail & Related papers (2022-01-20T18:59:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.