GliTr: Glimpse Transformers with Spatiotemporal Consistency for Online
Action Prediction
- URL: http://arxiv.org/abs/2210.13605v2
- Date: Wed, 19 Apr 2023 00:41:39 GMT
- Title: GliTr: Glimpse Transformers with Spatiotemporal Consistency for Online
Action Prediction
- Authors: Samrudhdhi B Rangrej, Kevin J Liang, Tal Hassner, James J Clark
- Abstract summary: Many online action prediction models observe complete frames to locate and attend to informative subregions in the frames called glimpses.
In this paper, we develop Glimpse Transformers (GliTr), which observe only narrow glimpses at all times.
- Score: 26.184988507662535
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many online action prediction models observe complete frames to locate and
attend to informative subregions in the frames called glimpses and recognize an
ongoing action based on global and local information. However, in applications
with constrained resources, an agent may not be able to observe the complete
frame, yet must still locate useful glimpses to predict an incomplete action
based on local information only. In this paper, we develop Glimpse Transformers
(GliTr), which observe only narrow glimpses at all times, thus predicting an
ongoing action and the following most informative glimpse location based on the
partial spatiotemporal information collected so far. In the absence of a ground
truth for the optimal glimpse locations for action recognition, we train GliTr
using a novel spatiotemporal consistency objective: We require GliTr to attend
to the glimpses with features similar to the corresponding complete frames
(i.e. spatial consistency) and the resultant class logits at time $t$
equivalent to the ones predicted using whole frames up to $t$ (i.e. temporal
consistency). Inclusion of our proposed consistency objective yields ~10%
higher accuracy on the Something-Something-v2 (SSv2) dataset than the baseline
cross-entropy objective. Overall, despite observing only ~33% of the total area
per frame, GliTr achieves 53.02% and 93.91% accuracy on the SSv2 and Jester
datasets, respectively.
Related papers
- SV-GS: Sparse View 4D Reconstruction with Skeleton-Driven Gaussian Splatting [19.12278036176021]
We present SV-GS, a framework that simultaneously estimates a deformation model and the object's motion over time under sparse observations.<n>Our method outperforms existing approaches under sparse observations by up to 34% in PSNR.
arXiv Detail & Related papers (2026-01-01T09:53:03Z) - S$^2$Transformer: Scalable Structured Transformers for Global Station Weather Forecasting [67.93713728260646]
Existing time series forecasting methods often ignore or unidirectionally model spatial correlation when conducting large-scale global station forecasting.<n>This contradicts the nature underlying observations of the global weather system limiting forecast performance.<n>We propose a novel Structured Spatial Attention in this paper.<n>It partitions the spatial graph into a set of subgraphs and instantiates Intra-subgraph Attention to learn local spatial correlation within each subgraph.<n>It aggregates nodes into subgraph representations for message passing among the subgraphs via Inter-subgraph Attention -- considering both spatial proximity and global correlation.
arXiv Detail & Related papers (2025-09-10T05:33:28Z) - Local2Global query Alignment for Video Instance Segmentation [6.422775545814375]
Video segmentation methods excel at handling long sequences and capturing gradual changes, making them ideal for real-world applications.<n>This paper introduces Local2Global, an online framework, for instance segmentation, exhibiting state-of-the-art performance with simple baseline and training purely in online fashion.<n>We propose the L2G-aligner, a novel lightweight transformer decoder, to facilitate an early alignment between local and global queries.
arXiv Detail & Related papers (2025-07-27T04:04:01Z) - Local-Global Information Interaction Debiasing for Dynamic Scene Graph
Generation [51.92419880088668]
We propose a novel DynSGG model based on multi-task learning, DynSGG-MTL, which introduces the local interaction information and global human-action interaction information.
Long-temporal human actions supervise the model to generate multiple scene graphs that conform to the global constraints and avoid the model being unable to learn the tail predicates.
arXiv Detail & Related papers (2023-08-10T01:24:25Z) - Constructing Holistic Spatio-Temporal Scene Graph for Video Semantic
Role Labeling [96.64607294592062]
Video Semantic Label Roleing (VidSRL) aims to detect salient events from given videos.
Recent endeavors have put forth methods for VidSRL, but they can be subject to two key drawbacks.
arXiv Detail & Related papers (2023-08-09T17:20:14Z) - TAPIR: Tracking Any Point with per-frame Initialization and temporal
Refinement [64.11385310305612]
We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence.
Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations.
The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS.
arXiv Detail & Related papers (2023-06-14T17:07:51Z) - Stand-Alone Inter-Frame Attention in Video Models [164.06137994796487]
We present a new recipe of inter-frame attention block, namely Stand-alone Inter-temporal Attention (SIFA)
SIFA remoulds the deformable design via re-scaling the offset predictions by the difference between two frames.
We further plug SIFA block into ConvNets and Vision Transformer, respectively, to devise SIFA-Net and SIFA-Transformer.
arXiv Detail & Related papers (2022-06-14T15:51:28Z) - Consistency driven Sequential Transformers Attention Model for Partially
Observable Scenes [3.652509571098291]
We develop a Sequential Transformers Attention Model (STAM) that only partially observes a complete image.
Our agent outperforms previous state-of-the-art by observing nearly 27% and 42% fewer pixels in glimpses on ImageNet and fMoW.
arXiv Detail & Related papers (2022-04-01T18:51:55Z) - Temporal Transductive Inference for Few-Shot Video Object Segmentation [27.140141181513425]
Few-shot object segmentation (FS-VOS) aims at segmenting video frames using a few labelled examples of classes not seen during initial training.
Key to our approach is the use of both global and local temporal constraints.
Empirically, our model outperforms state-of-the-art meta-learning approaches in terms of mean intersection over union on YouTube-VIS by 2.8%.
arXiv Detail & Related papers (2022-03-27T14:08:30Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - Higher Performance Visual Tracking with Dual-Modal Localization [106.91097443275035]
Visual Object Tracking (VOT) has synchronous needs for both robustness and accuracy.
We propose a dual-modal framework for target localization, consisting of robust localization suppressingors via ONR and the accurate localization attending to the target center precisely via OFC.
arXiv Detail & Related papers (2021-03-18T08:47:56Z) - Passenger Mobility Prediction via Representation Learning for Dynamic
Directed and Weighted Graph [31.062303389341317]
We propose a noveltemporal graph attention network namely Gallat (Graph prediction with all attention) as a solution.
In Gallat, by comprehensively incorporating those three intrinsic properties of DDW graphs, we build three attention layers to fully capture the dependencies among different regions across all historical time slots.
We evaluate proposed model on real-world datasets, and our experimental results demonstrate that Gallat outperforms the state-of-the-art approaches.
arXiv Detail & Related papers (2021-01-04T03:32:01Z) - A Graph Attention Spatio-temporal Convolutional Network for 3D Human
Pose Estimation in Video [7.647599484103065]
We improve the learning of constraints in human skeleton by modeling local global spatial information via attention mechanisms.
Our approach effectively mitigates depth ambiguity and self-occlusion, generalizes to half upper body estimation, and achieves competitive performance on 2D-to-3D video pose estimation.
arXiv Detail & Related papers (2020-03-11T14:54:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.