Pyramid Region-based Slot Attention Network for Temporal Action Proposal
Generation
- URL: http://arxiv.org/abs/2206.10095v1
- Date: Tue, 21 Jun 2022 03:40:58 GMT
- Title: Pyramid Region-based Slot Attention Network for Temporal Action Proposal
Generation
- Authors: Shuaicheng Li, Feng Zhang, Rui-Wei Zhao, Rui Feng, Kunlin Yang, Lingbo
Liu, Jun Hou
- Abstract summary: temporal action proposal generation can largely benefit from proper temporal and semantic context exploitation.
We present a novel Pyramid Region-based Slot Attention Network PRSA-Net to learn a unified visual representation with rich temporal and semantic context.
- Score: 17.01865793062819
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: It has been found that temporal action proposal generation, which aims to
discover the temporal action instances within the range of the start and end
frames in the untrimmed videos, can largely benefit from proper temporal and
semantic context exploitation. The latest efforts were dedicated to considering
the temporal context and similarity-based semantic contexts through
self-attention modules. However, they still suffer from cluttered background
information and limited contextual feature learning. In this paper, we propose
a novel Pyramid Region-based Slot Attention (PRSlot) module to address these
issues. Instead of using the similarity computation, our PRSlot module directly
learns the local relations in an encoder-decoder manner and generates the
representation of a local region enhanced based on the attention over input
features called \textit{slot}. Specifically, upon the input snippet-level
features, PRSlot module takes the target snippet as \textit{query}, its
surrounding region as \textit{key} and then generates slot representations for
each \textit{query-key} slot by aggregating the local snippet context with a
parallel pyramid strategy. Based on PRSlot modules, we present a novel Pyramid
Region-based Slot Attention Network termed PRSA-Net to learn a unified visual
representation with rich temporal and semantic context for better proposal
generation. Extensive experiments are conducted on two widely adopted THUMOS14
and ActivityNet-1.3 benchmarks. Our PRSA-Net outperforms other state-of-the-art
methods. In particular, we improve the AR@100 from the previous best 50.67% to
56.12% for proposal generation and raise the mAP under 0.5 tIoU from 51.9\% to
58.7\% for action detection on THUMOS14. \textit{Code is available at}
\url{https://github.com/handhand123/PRSA-Net}
Related papers
- TimeLoc: A Unified End-to-End Framework for Precise Timestamp Localization in Long Videos [50.04992164981131]
Temporal localization in untrimmed videos is crucial for video understanding but remains challenging.
This task encompasses several subtasks, including temporal action localization, temporal video grounding, moment retrieval, and generic event boundary detection.
We propose TimeLoc, a unified end-to-end framework for timestamp localization that can handle multiple tasks.
arXiv Detail & Related papers (2025-03-09T09:11:26Z) - Local Compressed Video Stream Learning for Generic Event Boundary
Detection [25.37983456118522]
Event boundary detection aims to localize the generic, taxonomy-free event boundaries that segment videos into chunks.
Existing methods typically require video frames to be decoded before feeding into the network.
We propose a novel event boundary detection method that is fully end-to-end leveraging rich information in the compressed domain.
arXiv Detail & Related papers (2023-09-27T06:49:40Z) - UMMAFormer: A Universal Multimodal-adaptive Transformer Framework for
Temporal Forgery Localization [16.963092523737593]
We propose a novel framework for temporal forgery localization (TFL) that predicts forgery segments with multimodal adaptation.
Our approach achieves state-of-the-art performance on benchmark datasets, including Lav-DF, TVIL, and Psynd.
arXiv Detail & Related papers (2023-08-28T08:20:30Z) - Region-Enhanced Feature Learning for Scene Semantic Segmentation [19.20735517821943]
We propose using regions as the intermediate representation of point clouds instead of fine-grained points or voxels to reduce the computational burden.
We design a region-based feature enhancement (RFE) module, which consists of a Semantic-Spatial Region Extraction stage and a Region Dependency Modeling stage.
Our REFL-Net achieves 1.8% mIoU gain on ScanNetV2 and 1.7% mIoU gain on S3DIS datasets with negligible computational cost.
arXiv Detail & Related papers (2023-04-15T06:35:06Z) - Semantic Segmentation by Early Region Proxy [53.594035639400616]
We present a novel and efficient modeling that starts from interpreting the image as a tessellation of learnable regions.
To model region-wise context, we exploit Transformer to encode regions in a sequence-to-sequence manner.
Semantic segmentation is now carried out as per-region prediction on top of the encoded region embeddings.
arXiv Detail & Related papers (2022-03-26T10:48:32Z) - Global Aggregation then Local Distribution for Scene Parsing [99.1095068574454]
We show that our approach can be modularized as an end-to-end trainable block and easily plugged into existing semantic segmentation networks.
Our approach allows us to build new state of the art on major semantic segmentation benchmarks including Cityscapes, ADE20K, Pascal Context, Camvid and COCO-stuff.
arXiv Detail & Related papers (2021-07-28T03:46:57Z) - Context-aware Biaffine Localizing Network for Temporal Sentence
Grounding [61.18824806906945]
This paper addresses the problem of temporal sentence grounding (TSG)
TSG aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query.
We propose a novel localization framework that scores all pairs of start and end indices within the video simultaneously with a biaffine mechanism.
arXiv Detail & Related papers (2021-03-22T03:13:05Z) - Spatiotemporal Graph Neural Network based Mask Reconstruction for Video
Object Segmentation [70.97625552643493]
This paper addresses the task of segmenting class-agnostic objects in semi-supervised setting.
We propose a novel graph neuralS network (TG-Net) which captures the local contexts by utilizing all proposals.
arXiv Detail & Related papers (2020-12-10T07:57:44Z) - TSPNet: Hierarchical Feature Learning via Temporal Semantic Pyramid for
Sign Language Translation [101.6042317204022]
Sign language translation (SLT) aims to interpret sign video sequences into text-based natural language sentences.
Existing SLT models usually represent sign visual features in a frame-wise manner.
We develop a novel hierarchical sign video feature learning method via a temporal semantic pyramid network, called TSPNet.
arXiv Detail & Related papers (2020-10-12T05:58:09Z) - Local-Global Video-Text Interactions for Temporal Grounding [77.5114709695216]
This paper addresses the problem of text-to-video temporal grounding, which aims to identify the time interval in a video semantically relevant to a text query.
We tackle this problem using a novel regression-based model that learns to extract a collection of mid-level features for semantic phrases in a text query.
The proposed method effectively predicts the target time interval by exploiting contextual information from local to global.
arXiv Detail & Related papers (2020-04-16T08:10:41Z) - Co-Saliency Spatio-Temporal Interaction Network for Person
Re-Identification in Videos [85.6430597108455]
We propose a novel Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person re-identification in videos.
It captures the common salient foreground regions among video frames and explores the spatial-temporal long-range context interdependency from such regions.
Multiple spatialtemporal interaction modules within CSTNet are proposed, which exploit the spatial and temporal long-range context interdependencies on such features and spatial-temporal information correlation.
arXiv Detail & Related papers (2020-04-10T10:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.