Long Short-Term Transformer for Online Action Detection
- URL: http://arxiv.org/abs/2107.03377v1
- Date: Wed, 7 Jul 2021 17:49:51 GMT
- Title: Long Short-Term Transformer for Online Action Detection
- Authors: Mingze Xu, Yuanjun Xiong, Hao Chen, Xinyu Li, Wei Xia, Zhuowen Tu,
Stefano Soatto
- Abstract summary: Long Short-term TRansformer (LSTR) is a new temporal modeling algorithm for online action detection.
Compared to prior work, LSTR provides an effective and efficient method to model long videos with less algorithm design.
- Score: 96.23884916995978
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present Long Short-term TRansformer (LSTR), a new temporal
modeling algorithm for online action detection, by employing a long- and
short-term memories mechanism that is able to model prolonged sequence data. It
consists of an LSTR encoder that is capable of dynamically exploiting
coarse-scale historical information from an extensively long time window (e.g.,
2048 long-range frames of up to 8 minutes), together with an LSTR decoder that
focuses on a short time window (e.g., 32 short-range frames of 8 seconds) to
model the fine-scale characterization of the ongoing event. Compared to prior
work, LSTR provides an effective and efficient method to model long videos with
less heuristic algorithm design. LSTR achieves significantly improved results
on standard online action detection benchmarks, THUMOS'14, TVSeries, and HACS
Segment, over the existing state-of-the-art approaches. Extensive empirical
analysis validates the setup of the long- and short-term memories and the
design choices of LSTR.
Related papers
- AverageLinear: Enhance Long-Term Time series forcasting with simple averaging [6.125620036017928]
Long-term time series analysis aims to forecast long-term trends by examining changes over past and future periods.
Models based on the Transformer architecture, through the application of attention mechanisms, have demonstrated notable performance advantages.
Our research reveals that the attention mechanism is not the core component responsible for performance enhancement.
arXiv Detail & Related papers (2024-12-30T05:56:25Z) - Breaking the Context Bottleneck on Long Time Series Forecasting [6.36010639533526]
Long-term time-series forecasting is essential for planning and decision-making in economics, energy, and transportation.
Recent advancements have enhanced the efficiency of these models, but the challenge of effectively leveraging longer sequences persists.
We propose the Logsparse Decomposable Multiscaling (LDM) framework for the efficient and effective processing of long sequences.
arXiv Detail & Related papers (2024-12-21T10:29:34Z) - Forgetting Curve: A Reliable Method for Evaluating Memorization Capability for Long-context Models [58.6172667880028]
We propose a new method called forgetting curve to measure the memorization capability of long-context models.
We show that forgetting curve has the advantage of being robust to the tested corpus and the experimental settings.
Our measurement provides empirical evidence for the effectiveness of transformer extension techniques while raises questions for the effective length of RNN/SSM based models.
arXiv Detail & Related papers (2024-10-07T03:38:27Z) - LongSkywork: A Training Recipe for Efficiently Extending Context Length in Large Language Models [61.12177317970258]
LongSkywork is a long-context Large Language Model capable of processing up to 200,000 tokens.
We develop two novel methods for creating synthetic data.
LongSkywork achieves outstanding performance on a variety of long-context benchmarks.
arXiv Detail & Related papers (2024-06-02T03:34:41Z) - Bidirectional Long-Range Parser for Sequential Data Understanding [3.76054468268713]
We introduce BLRP (Bidirectional Long-Range), a novel and versatile attention mechanism designed to increase performance and efficiency on long-sequence tasks.
We show the benefits and versatility of our approach on vision and language domains by demonstrating competitive results against state-of-the-art methods.
arXiv Detail & Related papers (2024-04-08T05:45:03Z) - Effective Long-Context Scaling of Foundation Models [90.57254298730923]
We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens.
Our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2.
arXiv Detail & Related papers (2023-09-27T21:41:49Z) - Efficient Long-Short Temporal Attention Network for Unsupervised Video
Object Segmentation [23.645412918420906]
Unsupervised Video Object (VOS) aims at identifying the contours of primary foreground objects in videos without any prior knowledge.
Previous methods do not fully use spatial-temporal context and fail to tackle this challenging task in real-time.
This motivates us to develop an efficient Long-Short Temporal Attention network (termed LSTA) for unsupervised VOS task from a holistic view.
arXiv Detail & Related papers (2023-09-21T01:09:46Z) - A Novel Long-term Iterative Mining Scheme for Video Salient Object
Detection [54.53335983750033]
Short-term methodology conflicts with the real mechanism of our visual system.
This paper proposes a novel VSOD approach, which performs VSOD in a complete long-term way.
The proposed approach outperforms almost all SOTA models on five widely used benchmark datasets.
arXiv Detail & Related papers (2022-06-20T04:27:47Z) - Long-Short Temporal Modeling for Efficient Action Recognition [32.159784061961886]
We propose a new two-stream action recognition network, termed as MENet, consisting of a Motion Enhancement (ME) module and a Video-level Aggregation (VLA) module.
For short-term motions, we design an efficient ME module to enhance the short-term motions by mingling the motion saliency among neighboring segments.
As for long-term aggregations, VLA is adopted at the top of the appearance branch to integrate the long-term dependencies across all segments.
arXiv Detail & Related papers (2021-06-30T02:54:13Z) - Finding Action Tubes with a Sparse-to-Dense Framework [62.60742627484788]
We propose a framework that generates action tube proposals from video streams with a single forward pass in a sparse-to-dense manner.
We evaluate the efficacy of our model on the UCF101-24, JHMDB-21 and UCFSports benchmark datasets.
arXiv Detail & Related papers (2020-08-30T15:38:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.