TemporalMaxer: Maximize Temporal Context with only Max Pooling for
Temporal Action Localization
- URL: http://arxiv.org/abs/2303.09055v1
- Date: Thu, 16 Mar 2023 03:11:26 GMT
- Title: TemporalMaxer: Maximize Temporal Context with only Max Pooling for
Temporal Action Localization
- Authors: Tuan N. Tang, Kwonyoung Kim, Kwanghoon Sohn
- Abstract summary: We introduce TemporalMaxer, which minimizes long-term temporal context modeling while maximizing information from the extracted video clip features.
We demonstrate that TemporalMaxer outperforms other state-of-the-art methods that utilize long-term temporal context modeling.
- Score: 52.234877003211814
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal Action Localization (TAL) is a challenging task in video
understanding that aims to identify and localize actions within a video
sequence. Recent studies have emphasized the importance of applying long-term
temporal context modeling (TCM) blocks to the extracted video clip features
such as employing complex self-attention mechanisms. In this paper, we present
the simplest method ever to address this task and argue that the extracted
video clip features are already informative to achieve outstanding performance
without sophisticated architectures. To this end, we introduce TemporalMaxer,
which minimizes long-term temporal context modeling while maximizing
information from the extracted video clip features with a basic,
parameter-free, and local region operating max-pooling block. Picking out only
the most critical information for adjacent and local clip embeddings, this
block results in a more efficient TAL model. We demonstrate that TemporalMaxer
outperforms other state-of-the-art methods that utilize long-term TCM such as
self-attention on various TAL datasets while requiring significantly fewer
parameters and computational resources. The code for our approach is publicly
available at https://github.com/TuanTNG/TemporalMaxer
Related papers
- LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding [65.46303012350207]
LongVU is an adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos.
We leverage DINOv2 features to remove redundant frames that exhibit high similarity.
We perform spatial token reduction across frames based on their temporal dependencies.
arXiv Detail & Related papers (2024-10-22T21:21:37Z) - Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities [67.89368528234394]
One of the main challenges of multimodal learning is the need to combine heterogeneous modalities.
Video and audio are obtained at much higher rates than text and are roughly aligned in time.
Our approach achieves the state-of-the-art on well established multimodal benchmarks, outperforming much larger models.
arXiv Detail & Related papers (2023-11-09T19:15:12Z) - Efficient Long-Short Temporal Attention Network for Unsupervised Video
Object Segmentation [23.645412918420906]
Unsupervised Video Object (VOS) aims at identifying the contours of primary foreground objects in videos without any prior knowledge.
Previous methods do not fully use spatial-temporal context and fail to tackle this challenging task in real-time.
This motivates us to develop an efficient Long-Short Temporal Attention network (termed LSTA) for unsupervised VOS task from a holistic view.
arXiv Detail & Related papers (2023-09-21T01:09:46Z) - How Much Temporal Long-Term Context is Needed for Action Segmentation? [16.89998201009075]
We introduce a transformer-based model that leverages sparse attention to capture the full context of a video.
Our experiments show that modeling the full context of a video is necessary to obtain the best performance for temporal action segmentation.
arXiv Detail & Related papers (2023-08-22T11:20:40Z) - UnLoc: A Unified Framework for Video Localization Tasks [82.59118972890262]
UnLoc is a new approach for temporal localization in untrimmed videos.
It uses pretrained image and text towers, and feeds tokens to a video-text fusion model.
We achieve state of the art results on all three different localization tasks with a unified approach.
arXiv Detail & Related papers (2023-08-21T22:15:20Z) - Implicit Temporal Modeling with Learnable Alignment for Video
Recognition [95.82093301212964]
We propose a novel Implicit Learnable Alignment (ILA) method, which minimizes the temporal modeling effort while achieving incredibly high performance.
ILA achieves a top-1 accuracy of 88.7% on Kinetics-400 with much fewer FLOPs compared with Swin-L and ViViT-H.
arXiv Detail & Related papers (2023-04-20T17:11:01Z) - Coarse-Fine Networks for Temporal Activity Detection in Videos [45.03545172714305]
We introduce 'Co-Fine Networks', a two-stream architecture which benefits from different abstractions of temporal resolution to learn better video representations for long-term motion.
We show that our method can outperform the state-of-the-arts for action detection in public datasets with a significantly reduced compute and memory footprint.
arXiv Detail & Related papers (2021-03-01T20:48:01Z) - NUTA: Non-uniform Temporal Aggregation for Action Recognition [29.75987323741384]
We propose a method called the non-uniform temporal aggregation (NUTA), which aggregates features only from informative temporal segments.
Our model has achieved state-of-the-art performance on four widely used large-scale action-recognition datasets.
arXiv Detail & Related papers (2020-12-15T02:03:37Z) - TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization
Tasks [79.01176229586855]
We propose a novel supervised pretraining paradigm for clip features that considers background clips and global video information to improve temporal sensitivity.
Extensive experiments show that using features trained with our novel pretraining strategy significantly improves the performance of recent state-of-the-art methods on three tasks.
arXiv Detail & Related papers (2020-11-23T15:40:15Z) - Frame-wise Cross-modal Matching for Video Moment Retrieval [32.68921139236391]
Video moment retrieval targets at retrieving a moment in a video for a given language query.
The challenges of this task include 1) the requirement of localizing the relevant moment in an untrimmed video, and 2) bridging the semantic gap between textual query and video contents.
We propose an Attentive Cross-modal Relevance Matching model which predicts the temporal boundaries based on an interaction modeling.
arXiv Detail & Related papers (2020-09-22T10:25:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.