Spatio-temporal Tubelet Feature Aggregation and Object Linking in Videos
- URL: http://arxiv.org/abs/2004.00451v2
- Date: Fri, 6 Nov 2020 12:17:33 GMT
- Title: Spatio-temporal Tubelet Feature Aggregation and Object Linking in Videos
- Authors: Daniel Cores, V\'ictor M. Brea and Manuel Mucientes
- Abstract summary: paper addresses the problem of how to exploittemporal information in available videos to improve the object classification.
We propose a two stage object detector called FANet based on short-term detection aggregation feature.
- Score: 2.4923006485141284
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper addresses the problem of how to exploit spatio-temporal
information available in videos to improve the object detection precision. We
propose a two stage object detector called FANet based on short-term
spatio-temporal feature aggregation to give a first detection set, and
long-term object linking to refine these detections. Firstly, we generate a set
of short tubelet proposals containing the object in $N$ consecutive frames.
Then, we aggregate RoI pooled deep features through the tubelet using a
temporal pooling operator that summarizes the information with a fixed size
output independent of the number of input frames. On top of that, we define a
double head implementation that we feed with spatio-temporal aggregated
information for spatio-temporal object classification, and with spatial
information extracted from the current frame for object localization and
spatial classification. Furthermore, we also specialize each head branch
architecture to better perform in each task taking into account the input data.
Finally, a long-term linking method builds long tubes using the previously
calculated short tubelets to overcome detection errors. We have evaluated our
model in the widely used ImageNet VID dataset achieving a 80.9% mAP, which is
the new state-of-the-art result for single models. Also, in the challenging
small object detection dataset USC-GRAD-STDdb, our proposal outperforms the
single frame baseline by 5.4% mAP.
Related papers
- STCMOT: Spatio-Temporal Cohesion Learning for UAV-Based Multiple Object Tracking [13.269416985959404]
Multiple object tracking (MOT) in Unmanned Aerial Vehicle (UAV) videos is important for diverse applications in computer vision.
We propose a novel Spatio-Temporal Cohesion Multiple Object Tracking framework (STCMOT)
We use historical embedding features to model the representation of ReID and detection features in a sequential order.
Our framework sets a new state-of-the-art performance in MOTA and IDF1 metrics.
arXiv Detail & Related papers (2024-09-17T14:34:18Z) - PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection [66.94819989912823]
We propose a point-trajectory transformer with long short-term memory for efficient temporal 3D object detection.
We use point clouds of current-frame objects and their historical trajectories as input to minimize the memory bank storage requirement.
We conduct extensive experiments on the large-scale dataset to demonstrate that our approach performs well against state-of-the-art methods.
arXiv Detail & Related papers (2023-12-13T18:59:13Z) - Efficient Long-Short Temporal Attention Network for Unsupervised Video
Object Segmentation [23.645412918420906]
Unsupervised Video Object (VOS) aims at identifying the contours of primary foreground objects in videos without any prior knowledge.
Previous methods do not fully use spatial-temporal context and fail to tackle this challenging task in real-time.
This motivates us to develop an efficient Long-Short Temporal Attention network (termed LSTA) for unsupervised VOS task from a holistic view.
arXiv Detail & Related papers (2023-09-21T01:09:46Z) - Object-Centric Multiple Object Tracking [124.30650395969126]
This paper proposes a video object-centric model for multiple-object tracking pipelines.
It consists of an index-merge module that adapts the object-centric slots into detection outputs and an object memory module.
Benefited from object-centric learning, we only require sparse detection labels for object localization and feature binding.
arXiv Detail & Related papers (2023-09-01T03:34:12Z) - Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection [54.041049052843604]
We present STEMD, a novel end-to-end framework that enhances the DETR-like paradigm for multi-frame 3D object detection.
First, to model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network.
Finally, it poses a challenge for the network to distinguish between the positive query and other highly similar queries that are not the best match.
arXiv Detail & Related papers (2023-07-01T13:53:14Z) - Spatio-Temporal Learnable Proposals for End-to-End Video Object
Detection [12.650574326251023]
We present SparseVOD, a novel video object detection pipeline that employs Sparse R-CNN to exploit temporal information.
Our method significantly improves the single-frame Sparse RCNN by 8%-9% in mAP.
arXiv Detail & Related papers (2022-10-05T16:17:55Z) - ST-DETR: Spatio-Temporal Object Traces Attention Detection Transformer [2.4366811507669124]
We propose a Spatio-Temporal Transformer-based architecture for object detection from a sequence of temporal frames.
We employ the full attention mechanisms to take advantage of the features correlations over both dimensions.
Results show a significant 5% mAP improvement on the KITTI MOD dataset.
arXiv Detail & Related papers (2021-07-13T07:38:08Z) - Prototypical Cross-Attention Networks for Multiple Object Tracking and
Segmentation [95.74244714914052]
Multiple object tracking and segmentation requires detecting, tracking, and segmenting objects belonging to a set of given classes.
We propose Prototypical Cross-Attention Network (PCAN), capable of leveraging rich-temporal information online.
PCAN outperforms current video instance tracking and segmentation competition winners on Youtube-VIS and BDD100K datasets.
arXiv Detail & Related papers (2021-06-22T17:57:24Z) - DS-Net: Dynamic Spatiotemporal Network for Video Salient Object
Detection [78.04869214450963]
We propose a novel dynamic temporal-temporal network (DSNet) for more effective fusion of temporal and spatial information.
We show that the proposed method achieves superior performance than state-of-the-art algorithms.
arXiv Detail & Related papers (2020-12-09T06:42:30Z) - LiDAR-based Online 3D Video Object Detection with Graph-based Message
Passing and Spatiotemporal Transformer Attention [100.52873557168637]
3D object detectors usually focus on the single-frame detection, while ignoring the information in consecutive point cloud frames.
In this paper, we propose an end-to-end online 3D video object detector that operates on point sequences.
arXiv Detail & Related papers (2020-04-03T06:06:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.