Towards Grouping in Large Scenes with Occlusion-aware Spatio-temporal
Transformers
- URL: http://arxiv.org/abs/2310.19447v1
- Date: Mon, 30 Oct 2023 11:17:22 GMT
- Title: Towards Grouping in Large Scenes with Occlusion-aware Spatio-temporal
Transformers
- Authors: Jinsong Zhang and Lingfeng Gu and Yu-Kun Lai and Xueyang Wang and Kun
Li
- Abstract summary: Group detection especially for large-scale scenes has many potential applications for public safety and smart cities.
Existing methods fail to cope with frequent occlusions in large-scale scenes with multiple people.
In this paper, we propose an end-to-end framework,Transformer for group detection in large-scale scenes.
- Score: 47.83631610648981
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Group detection, especially for large-scale scenes, has many potential
applications for public safety and smart cities. Existing methods fail to cope
with frequent occlusions in large-scale scenes with multiple people, and are
difficult to effectively utilize spatio-temporal information. In this paper, we
propose an end-to-end framework,GroupTransformer, for group detection in
large-scale scenes. To deal with the frequent occlusions caused by multiple
people, we design an occlusion encoder to detect and suppress severely occluded
person crops. To explore the potential spatio-temporal relationship, we propose
spatio-temporal transformers to simultaneously extract trajectory information
and fuse inter-person features in a hierarchical manner. Experimental results
on both large-scale and small-scale scenes demonstrate that our method achieves
better performance compared with state-of-the-art methods. On large-scale
scenes, our method significantly boosts the performance in terms of precision
and F1 score by more than 10%. On small-scale scenes, our method still improves
the performance of F1 score by more than 5%. The project page with code can be
found at http://cic.tju.edu.cn/faculty/likun/projects/GroupTrans.
Related papers
- Regularized Contrastive Partial Multi-view Outlier Detection [76.77036536484114]
We propose a novel method named Regularized Contrastive Partial Multi-view Outlier Detection (RCPMOD)
In this framework, we utilize contrastive learning to learn view-consistent information and distinguish outliers by the degree of consistency.
Experimental results on four benchmark datasets demonstrate that our proposed approach could outperform state-of-the-art competitors.
arXiv Detail & Related papers (2024-08-02T14:34:27Z) - Practical Video Object Detection via Feature Selection and Aggregation [18.15061460125668]
Video object detection (VOD) needs to concern the high across-frame variation in object appearance, and the diverse deterioration in some frames.
Most of contemporary aggregation methods are tailored for two-stage detectors, suffering from high computational costs.
This study invents a very simple yet potent strategy of feature selection and aggregation, gaining significant accuracy at marginal computational expense.
arXiv Detail & Related papers (2024-07-29T02:12:11Z) - Delving into CLIP latent space for Video Anomaly Recognition [24.37974279994544]
We introduce the novel method AnomalyCLIP, the first to combine Large Language and Vision (LLV) models, such as CLIP.
Our approach specifically involves manipulating the latent CLIP feature space to identify the normal event subspace.
When anomalous frames are projected onto these directions, they exhibit a large feature magnitude if they belong to a particular class.
arXiv Detail & Related papers (2023-10-04T14:01:55Z) - CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point
Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation.
We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration.
The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z) - Towards Robust Semantic Segmentation of Accident Scenes via Multi-Source
Mixed Sampling and Meta-Learning [29.74171323437029]
We propose a Multi-source Meta-learning Unsupervised Domain Adaptation framework, to improve the generalization of segmentation transformers to extreme accident scenes.
Our approach achieves a mIoU score of 46.97% on the DADA-seg benchmark, surpassing the previous state-of-the-art model by more than 7.50%.
arXiv Detail & Related papers (2022-03-19T21:18:54Z) - Congested Crowd Instance Localization with Dilated Convolutional Swin
Transformer [119.72951028190586]
Crowd localization is a new computer vision task, evolved from crowd counting.
In this paper, we focus on how to achieve precise instance localization in high-density crowd scenes.
We propose a Dilated Convolutional Swin Transformer (DCST) for congested crowd scenes.
arXiv Detail & Related papers (2021-08-02T01:27:53Z) - Augmented Transformer with Adaptive Graph for Temporal Action Proposal
Generation [79.98992138865042]
We present an augmented transformer with adaptive graph network (ATAG) to exploit both long-range and local temporal contexts for TAPG.
Specifically, we enhance the vanilla transformer by equipping a snippet actionness loss and a front block, dubbed augmented transformer.
An adaptive graph convolutional network (GCN) is proposed to build local temporal context by mining the position information and difference between adjacent features.
arXiv Detail & Related papers (2021-03-30T02:01:03Z) - Anchor-free Small-scale Multispectral Pedestrian Detection [88.7497134369344]
We propose a method for effective and efficient multispectral fusion of the two modalities in an adapted single-stage anchor-free base architecture.
We aim at learning pedestrian representations based on object center and scale rather than direct bounding box predictions.
Results show our method's effectiveness in detecting small-scaled pedestrians.
arXiv Detail & Related papers (2020-08-19T13:13:01Z) - Two-branch Recurrent Network for Isolating Deepfakes in Videos [17.59209853264258]
We present a method for deepfake detection based on a two-branch network structure.
One branch propagates the original information, while the other branch suppresses the face content.
Our two novel components show promising results on the FaceForensics++, Celeb-DF, and Facebook's DFDC preview benchmarks.
arXiv Detail & Related papers (2020-08-08T01:38:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.