Multiple Object Tracking with Correlation Learning
- URL: http://arxiv.org/abs/2104.03541v1
- Date: Thu, 8 Apr 2021 06:48:02 GMT
- Title: Multiple Object Tracking with Correlation Learning
- Authors: Qiang Wang, Yun Zheng, Pan Pan, Yinghui Xu
- Abstract summary: We propose to exploit the local correlation module to model the topological relationship between targets and their surrounding environment.
Specifically, we establish dense correspondences of each spatial location and its context, and explicitly constrain the correlation volumes through self-supervised learning.
Our approach demonstrates the effectiveness of correlation learning with the superior performance and obtains state-of-the-art MOTA of 76.5% and IDF1 of 73.6% on MOT17.
- Score: 16.959379957515974
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent works have shown that convolutional networks have substantially
improved the performance of multiple object tracking by simultaneously learning
detection and appearance features. However, due to the local perception of the
convolutional network structure itself, the long-range dependencies in both the
spatial and temporal cannot be obtained efficiently. To incorporate the spatial
layout, we propose to exploit the local correlation module to model the
topological relationship between targets and their surrounding environment,
which can enhance the discriminative power of our model in crowded scenes.
Specifically, we establish dense correspondences of each spatial location and
its context, and explicitly constrain the correlation volumes through
self-supervised learning. To exploit the temporal context, existing approaches
generally utilize two or more adjacent frames to construct an enhanced feature
representation, but the dynamic motion scene is inherently difficult to depict
via CNNs. Instead, our paper proposes a learnable correlation operator to
establish frame-to-frame matches over convolutional feature maps in the
different layers to align and propagate temporal context. With extensive
experimental results on the MOT datasets, our approach demonstrates the
effectiveness of correlation learning with the superior performance and obtains
state-of-the-art MOTA of 76.5% and IDF1 of 73.6% on MOT17.
Related papers
- Multi-Scale Spatial-Temporal Self-Attention Graph Convolutional Networks for Skeleton-based Action Recognition [0.0]
In this paper, we propose self-attention GCN hybrid model, Multi-Scale Spatial-Temporal self-attention (MSST)-GCN.
We utilize spatial self-attention module with adaptive topology to understand intra-frame interactions within a frame among different body parts, and temporal self-attention module to examine correlations between frames of a node.
arXiv Detail & Related papers (2024-04-03T10:25:45Z) - Multi-Temporal Relationship Inference in Urban Areas [75.86026742632528]
Finding temporal relationships among locations can benefit a bunch of urban applications, such as dynamic offline advertising and smart public transport planning.
We propose a solution to Trial with a graph learning scheme, which includes a spatially evolving graph neural network (SEENet)
SEConv performs the intra-time aggregation and inter-time propagation to capture the multifaceted spatially evolving contexts from the view of location message passing.
SE-SSL designs time-aware self-supervised learning tasks in a global-local manner with additional evolving constraint to enhance the location representation learning and further handle the relationship sparsity.
arXiv Detail & Related papers (2023-06-15T07:48:32Z) - Intensity Profile Projection: A Framework for Continuous-Time
Representation Learning for Dynamic Networks [50.2033914945157]
We present a representation learning framework, Intensity Profile Projection, for continuous-time dynamic network data.
The framework consists of three stages: estimating pairwise intensity functions, learning a projection which minimises a notion of intensity reconstruction error.
Moreoever, we develop estimation theory providing tight control on the error of any estimated trajectory, indicating that the representations could even be used in quite noise-sensitive follow-on analyses.
arXiv Detail & Related papers (2023-06-09T15:38:25Z) - Dynamic Graph Convolutional Network with Attention Fusion for Traffic
Flow Prediction [10.3426659705376]
We propose a novel dynamic graph convolution network with attention fusion to model synchronous spatial-temporal correlations.
We conduct extensive experiments in four real-world traffic datasets to demonstrate that our method surpasses state-of-the-art performance compared to 18 baseline methods.
arXiv Detail & Related papers (2023-02-24T12:21:30Z) - Spatio-Temporal Relation Learning for Video Anomaly Detection [35.59510027883497]
Anomaly identification is highly dependent on the relationship between the object and the scene.
In this paper, we propose a Spatial-Temporal Relation Learning framework to tackle the video anomaly detection task.
Experiments are conducted on three public datasets, and the superior performance over the state-of-the-art methods demonstrates the effectiveness of our method.
arXiv Detail & Related papers (2022-09-27T02:19:31Z) - Learning Appearance-motion Normality for Video Anomaly Detection [11.658792932975652]
We propose spatial-temporal memories augmented two-stream auto-encoder framework.
It learns the appearance normality and motion normality independently and explores the correlations via adversarial learning.
Our framework outperforms the state-of-the-art methods, achieving AUCs of 98.1% and 89.8% on UCSD Ped2 and CUHK Avenue datasets.
arXiv Detail & Related papers (2022-07-27T08:30:19Z) - DMGCRN: Dynamic Multi-Graph Convolution Recurrent Network for Traffic
Forecasting [7.232141271583618]
We propose a novel dynamic multi-graph convolution recurrent network (DMG) to tackle above issues.
We use the distance-based graph to capture spatial information from nodes are close in distance.
We also construct a novel latent graph which encoded the structure correlations among roads to capture spatial information from nodes are similar in structure.
arXiv Detail & Related papers (2021-12-04T06:51:55Z) - Modelling Neighbor Relation in Joint Space-Time Graph for Video
Correspondence Learning [53.74240452117145]
This paper presents a self-supervised method for learning reliable visual correspondence from unlabeled videos.
We formulate the correspondence as finding paths in a joint space-time graph, where nodes are grid patches sampled from frames, and are linked by two types of edges.
Our learned representation outperforms the state-of-the-art self-supervised methods on a variety of visual tasks.
arXiv Detail & Related papers (2021-09-28T05:40:01Z) - Modeling long-term interactions to enhance action recognition [81.09859029964323]
We propose a new approach to under-stand actions in egocentric videos that exploits the semantics of object interactions at both frame and temporal levels.
We use a region-based approach that takes as input a primary region roughly corresponding to the user hands and a set of secondary regions potentially corresponding to the interacting objects.
The proposed approach outperforms the state-of-the-art in terms of action recognition on standard benchmarks.
arXiv Detail & Related papers (2021-04-23T10:08:15Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - Co-Saliency Spatio-Temporal Interaction Network for Person
Re-Identification in Videos [85.6430597108455]
We propose a novel Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person re-identification in videos.
It captures the common salient foreground regions among video frames and explores the spatial-temporal long-range context interdependency from such regions.
Multiple spatialtemporal interaction modules within CSTNet are proposed, which exploit the spatial and temporal long-range context interdependencies on such features and spatial-temporal information correlation.
arXiv Detail & Related papers (2020-04-10T10:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.