TemporalMaxer: Maximize Temporal Context with only Max Pooling for
Temporal Action Localization
- URL: http://arxiv.org/abs/2303.09055v1
- Date: Thu, 16 Mar 2023 03:11:26 GMT
- Title: TemporalMaxer: Maximize Temporal Context with only Max Pooling for
Temporal Action Localization
- Authors: Tuan N. Tang, Kwonyoung Kim, Kwanghoon Sohn
- Abstract summary: We introduce TemporalMaxer, which minimizes long-term temporal context modeling while maximizing information from the extracted video clip features.
We demonstrate that TemporalMaxer outperforms other state-of-the-art methods that utilize long-term temporal context modeling.
- Score: 52.234877003211814
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal Action Localization (TAL) is a challenging task in video
understanding that aims to identify and localize actions within a video
sequence. Recent studies have emphasized the importance of applying long-term
temporal context modeling (TCM) blocks to the extracted video clip features
such as employing complex self-attention mechanisms. In this paper, we present
the simplest method ever to address this task and argue that the extracted
video clip features are already informative to achieve outstanding performance
without sophisticated architectures. To this end, we introduce TemporalMaxer,
which minimizes long-term temporal context modeling while maximizing
information from the extracted video clip features with a basic,
parameter-free, and local region operating max-pooling block. Picking out only
the most critical information for adjacent and local clip embeddings, this
block results in a more efficient TAL model. We demonstrate that TemporalMaxer
outperforms other state-of-the-art methods that utilize long-term TCM such as
self-attention on various TAL datasets while requiring significantly fewer
parameters and computational resources. The code for our approach is publicly
available at https://github.com/TuanTNG/TemporalMaxer
Related papers
- Temporal Preference Optimization for Long-Form Video Understanding [28.623353303256653]
Temporal Preference Optimization (TPO) is a novel post-training framework designed to enhance the temporal grounding capabilities of video-LMMs.
TPO significantly enhances temporal understanding while reducing reliance on manually annotated data.
LLaVA-Video-TPO establishes itself as the leading 7B model on the Video-MME benchmark.
arXiv Detail & Related papers (2025-01-23T18:58:03Z) - Representing Long Volumetric Video with Temporal Gaussian Hierarchy [80.51373034419379]
This paper aims to address the challenge of reconstructing long volumetric videos from multi-view RGB videos.
We propose a novel 4D representation, named Temporal Gaussian Hierarchy, to compactly model long volumetric videos.
This work is the first approach capable of efficiently handling minutes of volumetric video data while maintaining state-of-the-art rendering quality.
arXiv Detail & Related papers (2024-12-12T18:59:34Z) - Video LLMs for Temporal Reasoning in Long Videos [7.2900856926028155]
TemporalVLM is a video large language model capable of effective temporal reasoning and fine-grained understanding in long videos.
Our approach includes a visual encoder for mapping a long-term input video into features which are time-aware and contain both local and global cues.
arXiv Detail & Related papers (2024-12-04T00:50:33Z) - LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding [65.46303012350207]
LongVU is an adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos.
We leverage DINOv2 features to remove redundant frames that exhibit high similarity.
We perform spatial token reduction across frames based on their temporal dependencies.
arXiv Detail & Related papers (2024-10-22T21:21:37Z) - Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities [67.89368528234394]
One of the main challenges of multimodal learning is the need to combine heterogeneous modalities.
Video and audio are obtained at much higher rates than text and are roughly aligned in time.
Our approach achieves the state-of-the-art on well established multimodal benchmarks, outperforming much larger models.
arXiv Detail & Related papers (2023-11-09T19:15:12Z) - Efficient Long-Short Temporal Attention Network for Unsupervised Video
Object Segmentation [23.645412918420906]
Unsupervised Video Object (VOS) aims at identifying the contours of primary foreground objects in videos without any prior knowledge.
Previous methods do not fully use spatial-temporal context and fail to tackle this challenging task in real-time.
This motivates us to develop an efficient Long-Short Temporal Attention network (termed LSTA) for unsupervised VOS task from a holistic view.
arXiv Detail & Related papers (2023-09-21T01:09:46Z) - How Much Temporal Long-Term Context is Needed for Action Segmentation? [16.89998201009075]
We introduce a transformer-based model that leverages sparse attention to capture the full context of a video.
Our experiments show that modeling the full context of a video is necessary to obtain the best performance for temporal action segmentation.
arXiv Detail & Related papers (2023-08-22T11:20:40Z) - UnLoc: A Unified Framework for Video Localization Tasks [82.59118972890262]
UnLoc is a new approach for temporal localization in untrimmed videos.
It uses pretrained image and text towers, and feeds tokens to a video-text fusion model.
We achieve state of the art results on all three different localization tasks with a unified approach.
arXiv Detail & Related papers (2023-08-21T22:15:20Z) - Implicit Temporal Modeling with Learnable Alignment for Video
Recognition [95.82093301212964]
We propose a novel Implicit Learnable Alignment (ILA) method, which minimizes the temporal modeling effort while achieving incredibly high performance.
ILA achieves a top-1 accuracy of 88.7% on Kinetics-400 with much fewer FLOPs compared with Swin-L and ViViT-H.
arXiv Detail & Related papers (2023-04-20T17:11:01Z) - Coarse-Fine Networks for Temporal Activity Detection in Videos [45.03545172714305]
We introduce 'Co-Fine Networks', a two-stream architecture which benefits from different abstractions of temporal resolution to learn better video representations for long-term motion.
We show that our method can outperform the state-of-the-arts for action detection in public datasets with a significantly reduced compute and memory footprint.
arXiv Detail & Related papers (2021-03-01T20:48:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.