Identity-Consistent Aggregation for Video Object Detection
- URL: http://arxiv.org/abs/2308.07737v1
- Date: Tue, 15 Aug 2023 12:30:22 GMT
- Title: Identity-Consistent Aggregation for Video Object Detection
- Authors: Chaorui Deng, Da Chen, Qi Wu
- Abstract summary: In Video Object Detection (VID), a common practice is to leverage the rich temporal contexts from the video to enhance the object representations in each frame.
We propose ClipVID, a VID model equipped with Identity-Consistent Aggregation layers specifically designed for mining fine-grained and identity-consistent temporal contexts.
Experiments demonstrate the superiority of our method: a state-of-the-art (SOTA) performance (84.7% mAP) on the ImageNet VID dataset while running at a speed about 7x faster (39.3 fps) than previous SOTAs.
- Score: 21.295859014601334
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In Video Object Detection (VID), a common practice is to leverage the rich
temporal contexts from the video to enhance the object representations in each
frame. Existing methods treat the temporal contexts obtained from different
objects indiscriminately and ignore their different identities. While
intuitively, aggregating local views of the same object in different frames may
facilitate a better understanding of the object. Thus, in this paper, we aim to
enable the model to focus on the identity-consistent temporal contexts of each
object to obtain more comprehensive object representations and handle the rapid
object appearance variations such as occlusion, motion blur, etc. However,
realizing this goal on top of existing VID models faces low-efficiency problems
due to their redundant region proposals and nonparallel frame-wise prediction
manner. To aid this, we propose ClipVID, a VID model equipped with
Identity-Consistent Aggregation (ICA) layers specifically designed for mining
fine-grained and identity-consistent temporal contexts. It effectively reduces
the redundancies through the set prediction strategy, making the ICA layers
very efficient and further allowing us to design an architecture that makes
parallel clip-wise predictions for the whole video clip. Extensive experimental
results demonstrate the superiority of our method: a state-of-the-art (SOTA)
performance (84.7% mAP) on the ImageNet VID dataset while running at a speed
about 7x faster (39.3 fps) than previous SOTAs.
Related papers
- Object-Centric Temporal Consistency via Conditional Autoregressive Inductive Biases [69.46487306858789]
Conditional Autoregressive Slot Attention (CA-SA) is a framework that enhances the temporal consistency of extracted object-centric representations in video-centric vision tasks.
We present qualitative and quantitative results showing that our proposed method outperforms the considered baselines on downstream tasks.
arXiv Detail & Related papers (2024-10-21T07:44:44Z) - Learning Spatial-Semantic Features for Robust Video Object Segmentation [108.045326229865]
We propose a robust video object segmentation framework equipped with spatial-semantic features and discriminative object queries.
We show that the proposed method set a new state-of-the-art performance on multiple datasets.
arXiv Detail & Related papers (2024-07-10T15:36:00Z) - Rethinking Image-to-Video Adaptation: An Object-centric Perspective [61.833533295978484]
We propose a novel and efficient image-to-video adaptation strategy from the object-centric perspective.
Inspired by human perception, we integrate a proxy task of object discovery into image-to-video transfer learning.
arXiv Detail & Related papers (2024-07-09T13:58:10Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Self-Supervised Video Representation Learning via Latent Time Navigation [12.721647696921865]
Self-supervised video representation learning aims at maximizing similarity between different temporal segments of one video.
We propose Latent Time Navigation (LTN) to capture fine-grained motions.
Our experimental analysis suggests that learning video representations by LTN consistently improves performance of action classification.
arXiv Detail & Related papers (2023-05-10T20:06:17Z) - DFA: Dynamic Feature Aggregation for Efficient Video Object Detection [15.897168900583774]
We propose a vanilla dynamic aggregation module that adaptively selects the frames for feature enhancement.
We extend the vanilla dynamic aggregation module to a more effective and reconfigurable deformable version.
On the ImageNet VID benchmark, integrated with our proposed methods, FGFA and SELSA can improve the inference speed by 31% and 76% respectively.
arXiv Detail & Related papers (2022-10-02T17:54:15Z) - Tackling Background Distraction in Video Object Segmentation [7.187425003801958]
A video object segmentation (VOS) aims to densely track certain objects in videos.
One of the main challenges in this task is the existence of background distractors that appear similar to the target objects.
We propose three novel strategies to suppress such distractors.
Our model achieves a comparable performance to contemporary state-of-the-art approaches, even with real-time performance.
arXiv Detail & Related papers (2022-07-14T14:25:19Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - Target-Aware Object Discovery and Association for Unsupervised Video
Multi-Object Segmentation [79.6596425920849]
This paper addresses the task of unsupervised video multi-object segmentation.
We introduce a novel approach for more accurate and efficient unseen-temporal segmentation.
We evaluate the proposed approach on DAVIS$_17$ and YouTube-VIS, and the results demonstrate that it outperforms state-of-the-art methods both in segmentation accuracy and inference speed.
arXiv Detail & Related papers (2021-04-10T14:39:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.