Related papers: GliTr: Glimpse Transformers with Spatiotemporal Consistency for Online Action Prediction

GliTr: Glimpse Transformers with Spatiotemporal Consistency for Online Action Prediction

URL: http://arxiv.org/abs/2210.13605v2
Date: Wed, 19 Apr 2023 00:41:39 GMT
Title: GliTr: Glimpse Transformers with Spatiotemporal Consistency for Online Action Prediction
Authors: Samrudhdhi B Rangrej, Kevin J Liang, Tal Hassner, James J Clark
Abstract summary: Many online action prediction models observe complete frames to locate and attend to informative subregions in the frames called glimpses. In this paper, we develop Glimpse Transformers (GliTr), which observe only narrow glimpses at all times.
Score: 26.184988507662535
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Many online action prediction models observe complete frames to locate and attend to informative subregions in the frames called glimpses and recognize an ongoing action based on global and local information. However, in applications with constrained resources, an agent may not be able to observe the complete frame, yet must still locate useful glimpses to predict an incomplete action based on local information only. In this paper, we develop Glimpse Transformers (GliTr), which observe only narrow glimpses at all times, thus predicting an ongoing action and the following most informative glimpse location based on the partial spatiotemporal information collected so far. In the absence of a ground truth for the optimal glimpse locations for action recognition, we train GliTr using a novel spatiotemporal consistency objective: We require GliTr to attend to the glimpses with features similar to the corresponding complete frames (i.e. spatial consistency) and the resultant class logits at time $t$ equivalent to the ones predicted using whole frames up to $t$ (i.e. temporal consistency). Inclusion of our proposed consistency objective yields ~10% higher accuracy on the Something-Something-v2 (SSv2) dataset than the baseline cross-entropy objective. Overall, despite observing only ~33% of the total area per frame, GliTr achieves 53.02% and 93.91% accuracy on the SSv2 and Jester datasets, respectively.

Related papers

SV-GS: Sparse View 4D Reconstruction with Skeleton-Driven Gaussian Splatting [19.12278036176021]
We present SV-GS, a framework that simultaneously estimates a deformation model and the object's motion over time under sparse observations.<n>Our method outperforms existing approaches under sparse observations by up to 34% in PSNR.
arXiv Detail & Related papers (2026-01-01T09:53:03Z)
S$^2$Transformer: Scalable Structured Transformers for Global Station Weather Forecasting [67.93713728260646]
Existing time series forecasting methods often ignore or unidirectionally model spatial correlation when conducting large-scale global station forecasting.<n>This contradicts the nature underlying observations of the global weather system limiting forecast performance.<n>We propose a novel Structured Spatial Attention in this paper.<n>It partitions the spatial graph into a set of subgraphs and instantiates Intra-subgraph Attention to learn local spatial correlation within each subgraph.<n>It aggregates nodes into subgraph representations for message passing among the subgraphs via Inter-subgraph Attention -- considering both spatial proximity and global correlation.
arXiv Detail & Related papers (2025-09-10T05:33:28Z)
Local2Global query Alignment for Video Instance Segmentation [6.422775545814375]
Video segmentation methods excel at handling long sequences and capturing gradual changes, making them ideal for real-world applications.<n>This paper introduces Local2Global, an online framework, for instance segmentation, exhibiting state-of-the-art performance with simple baseline and training purely in online fashion.<n>We propose the L2G-aligner, a novel lightweight transformer decoder, to facilitate an early alignment between local and global queries.
arXiv Detail & Related papers (2025-07-27T04:04:01Z)
Local-Global Information Interaction Debiasing for Dynamic Scene Graph Generation [51.92419880088668]
We propose a novel DynSGG model based on multi-task learning, DynSGG-MTL, which introduces the local interaction information and global human-action interaction information. Long-temporal human actions supervise the model to generate multiple scene graphs that conform to the global constraints and avoid the model being unable to learn the tail predicates.
arXiv Detail & Related papers (2023-08-10T01:24:25Z)
Constructing Holistic Spatio-Temporal Scene Graph for Video Semantic Role Labeling [96.64607294592062]
Video Semantic Label Roleing (VidSRL) aims to detect salient events from given videos. Recent endeavors have put forth methods for VidSRL, but they can be subject to two key drawbacks.
arXiv Detail & Related papers (2023-08-09T17:20:14Z)
TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement [64.11385310305612]
We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence. Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations. The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS.
arXiv Detail & Related papers (2023-06-14T17:07:51Z)
Stand-Alone Inter-Frame Attention in Video Models [164.06137994796487]
We present a new recipe of inter-frame attention block, namely Stand-alone Inter-temporal Attention (SIFA) SIFA remoulds the deformable design via re-scaling the offset predictions by the difference between two frames. We further plug SIFA block into ConvNets and Vision Transformer, respectively, to devise SIFA-Net and SIFA-Transformer.
arXiv Detail & Related papers (2022-06-14T15:51:28Z)
Consistency driven Sequential Transformers Attention Model for Partially Observable Scenes [3.652509571098291]
We develop a Sequential Transformers Attention Model (STAM) that only partially observes a complete image. Our agent outperforms previous state-of-the-art by observing nearly 27% and 42% fewer pixels in glimpses on ImageNet and fMoW.
arXiv Detail & Related papers (2022-04-01T18:51:55Z)
Temporal Transductive Inference for Few-Shot Video Object Segmentation [27.140141181513425]
Few-shot object segmentation (FS-VOS) aims at segmenting video frames using a few labelled examples of classes not seen during initial training. Key to our approach is the use of both global and local temporal constraints. Empirically, our model outperforms state-of-the-art meta-learning approaches in terms of mean intersection over union on YouTube-VIS by 2.8%.
arXiv Detail & Related papers (2022-03-27T14:08:30Z)
Spatial-Temporal Correlation and Topology Learning for Person Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation. CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body. It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z)
Higher Performance Visual Tracking with Dual-Modal Localization [106.91097443275035]
Visual Object Tracking (VOT) has synchronous needs for both robustness and accuracy. We propose a dual-modal framework for target localization, consisting of robust localization suppressingors via ONR and the accurate localization attending to the target center precisely via OFC.
arXiv Detail & Related papers (2021-03-18T08:47:56Z)
Passenger Mobility Prediction via Representation Learning for Dynamic Directed and Weighted Graph [31.062303389341317]
We propose a noveltemporal graph attention network namely Gallat (Graph prediction with all attention) as a solution. In Gallat, by comprehensively incorporating those three intrinsic properties of DDW graphs, we build three attention layers to fully capture the dependencies among different regions across all historical time slots. We evaluate proposed model on real-world datasets, and our experimental results demonstrate that Gallat outperforms the state-of-the-art approaches.
arXiv Detail & Related papers (2021-01-04T03:32:01Z)
A Graph Attention Spatio-temporal Convolutional Network for 3D Human Pose Estimation in Video [7.647599484103065]
We improve the learning of constraints in human skeleton by modeling local global spatial information via attention mechanisms. Our approach effectively mitigates depth ambiguity and self-occlusion, generalizes to half upper body estimation, and achieves competitive performance on 2D-to-3D video pose estimation.
arXiv Detail & Related papers (2020-03-11T14:54:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.