Long Short-Term Transformer for Online Action Detection
- URL: http://arxiv.org/abs/2107.03377v1
- Date: Wed, 7 Jul 2021 17:49:51 GMT
- Title: Long Short-Term Transformer for Online Action Detection
- Authors: Mingze Xu, Yuanjun Xiong, Hao Chen, Xinyu Li, Wei Xia, Zhuowen Tu,
Stefano Soatto
- Abstract summary: Long Short-term TRansformer (LSTR) is a new temporal modeling algorithm for online action detection.
Compared to prior work, LSTR provides an effective and efficient method to model long videos with less algorithm design.
- Score: 96.23884916995978
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present Long Short-term TRansformer (LSTR), a new temporal
modeling algorithm for online action detection, by employing a long- and
short-term memories mechanism that is able to model prolonged sequence data. It
consists of an LSTR encoder that is capable of dynamically exploiting
coarse-scale historical information from an extensively long time window (e.g.,
2048 long-range frames of up to 8 minutes), together with an LSTR decoder that
focuses on a short time window (e.g., 32 short-range frames of 8 seconds) to
model the fine-scale characterization of the ongoing event. Compared to prior
work, LSTR provides an effective and efficient method to model long videos with
less heuristic algorithm design. LSTR achieves significantly improved results
on standard online action detection benchmarks, THUMOS'14, TVSeries, and HACS
Segment, over the existing state-of-the-art approaches. Extensive empirical
analysis validates the setup of the long- and short-term memories and the
design choices of LSTR.
Related papers
- Hierarchical Document Refinement for Long-context Retrieval-augmented Generation [28.421675216147374]
LongRefiner is an efficient plug-and-play refiner that leverages the inherent structural characteristics of long documents.<n>LongRefiner achieves competitive performance in various scenarios while using 10x fewer computational costs and latency compared to the best baseline.
arXiv Detail & Related papers (2025-05-15T15:34:15Z) - Balancing long- and short-term dynamics for the modeling of saliency in videos [14.527351636175615]
We present a Transformer-based approach to learn a joint representation of video frames and past saliency information.
Our model embeds long- and short-term information to detect dynamically shifting saliency in video.
arXiv Detail & Related papers (2025-04-08T11:09:37Z) - Online Dense Point Tracking with Streaming Memory [54.22820729477756]
Dense point tracking is a challenging task requiring the continuous tracking of every point in the initial frame throughout a substantial portion of a video.<n>Recent point tracking algorithms usually depend on sliding windows for indirect information propagation from the first frame to the current one.<n>We present a lightweight and fast model with textbfStreaming memory for dense textbfPOint textbfTracking and online video processing.
arXiv Detail & Related papers (2025-03-09T06:16:49Z) - Breaking the Context Bottleneck on Long Time Series Forecasting [6.36010639533526]
Long-term time-series forecasting is essential for planning and decision-making in economics, energy, and transportation.
Recent advancements have enhanced the efficiency of these models, but the challenge of effectively leveraging longer sequences persists.
We propose the Logsparse Decomposable Multiscaling (LDM) framework for the efficient and effective processing of long sequences.
arXiv Detail & Related papers (2024-12-21T10:29:34Z) - LOGO -- Long cOntext aliGnment via efficient preference Optimization [29.510993993980573]
LOGO(Long cOntext aliGnment via efficient preference optimization) is a training strategy that first introduces preference optimization for long-context alignment.
By training with only 0.3B data on a single 8$times$A800 GPU machine for 16 hours, LOGO allows the Llama-3-8B-Instruct-80K model to achieve comparable performance with GPT-4.
arXiv Detail & Related papers (2024-10-24T08:27:26Z) - Forgetting Curve: A Reliable Method for Evaluating Memorization Capability for Long-context Models [58.6172667880028]
We propose a new method called forgetting curve to measure the memorization capability of long-context models.
We show that forgetting curve has the advantage of being robust to the tested corpus and the experimental settings.
Our measurement provides empirical evidence for the effectiveness of transformer extension techniques while raises questions for the effective length of RNN/SSM based models.
arXiv Detail & Related papers (2024-10-07T03:38:27Z) - LongSkywork: A Training Recipe for Efficiently Extending Context Length in Large Language Models [61.12177317970258]
LongSkywork is a long-context Large Language Model capable of processing up to 200,000 tokens.
We develop two novel methods for creating synthetic data.
LongSkywork achieves outstanding performance on a variety of long-context benchmarks.
arXiv Detail & Related papers (2024-06-02T03:34:41Z) - Bidirectional Long-Range Parser for Sequential Data Understanding [3.76054468268713]
We introduce BLRP (Bidirectional Long-Range), a novel and versatile attention mechanism designed to increase performance and efficiency on long-sequence tasks.
We show the benefits and versatility of our approach on vision and language domains by demonstrating competitive results against state-of-the-art methods.
arXiv Detail & Related papers (2024-04-08T05:45:03Z) - Effective Long-Context Scaling of Foundation Models [90.57254298730923]
We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens.
Our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2.
arXiv Detail & Related papers (2023-09-27T21:41:49Z) - Efficient Long-Short Temporal Attention Network for Unsupervised Video
Object Segmentation [23.645412918420906]
Unsupervised Video Object (VOS) aims at identifying the contours of primary foreground objects in videos without any prior knowledge.
Previous methods do not fully use spatial-temporal context and fail to tackle this challenging task in real-time.
This motivates us to develop an efficient Long-Short Temporal Attention network (termed LSTA) for unsupervised VOS task from a holistic view.
arXiv Detail & Related papers (2023-09-21T01:09:46Z) - A Novel Long-term Iterative Mining Scheme for Video Salient Object
Detection [54.53335983750033]
Short-term methodology conflicts with the real mechanism of our visual system.
This paper proposes a novel VSOD approach, which performs VSOD in a complete long-term way.
The proposed approach outperforms almost all SOTA models on five widely used benchmark datasets.
arXiv Detail & Related papers (2022-06-20T04:27:47Z) - Long-Short Temporal Modeling for Efficient Action Recognition [32.159784061961886]
We propose a new two-stream action recognition network, termed as MENet, consisting of a Motion Enhancement (ME) module and a Video-level Aggregation (VLA) module.
For short-term motions, we design an efficient ME module to enhance the short-term motions by mingling the motion saliency among neighboring segments.
As for long-term aggregations, VLA is adopted at the top of the appearance branch to integrate the long-term dependencies across all segments.
arXiv Detail & Related papers (2021-06-30T02:54:13Z) - Finding Action Tubes with a Sparse-to-Dense Framework [62.60742627484788]
We propose a framework that generates action tube proposals from video streams with a single forward pass in a sparse-to-dense manner.
We evaluate the efficacy of our model on the UCF101-24, JHMDB-21 and UCFSports benchmark datasets.
arXiv Detail & Related papers (2020-08-30T15:38:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.