Explore Spatio-temporal Aggregation for Insubstantial Object Detection:
Benchmark Dataset and Baseline
- URL: http://arxiv.org/abs/2206.11459v2
- Date: Fri, 4 Aug 2023 08:43:08 GMT
- Title: Explore Spatio-temporal Aggregation for Insubstantial Object Detection:
Benchmark Dataset and Baseline
- Authors: Kailai Zhou, Yibo Wang, Tao Lv, Yunqian Li, Linsen Chen, Qiu Shen, Xun
Cao
- Abstract summary: We endeavor on a rarely explored task named Insubstantial Object Detection (IOD), which aims to localize the object with following characteristics.
We construct an IOD-Video dataset comprised of 600 videos (141,017 frames) covering various distances, sizes, visibility, and scenes captured by different spectral ranges.
In addition, we develop atemporal aggregation framework for IOD, in which different backbones are deployed and atemporal aggregation loss (STAloss) is elaborately designed to leverage the consistency along time axis.
- Score: 16.59161777626215
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We endeavor on a rarely explored task named Insubstantial Object Detection
(IOD), which aims to localize the object with following characteristics: (1)
amorphous shape with indistinct boundary; (2) similarity to surroundings; (3)
absence in color. Accordingly, it is far more challenging to distinguish
insubstantial objects in a single static frame and the collaborative
representation of spatial and temporal information is crucial. Thus, we
construct an IOD-Video dataset comprised of 600 videos (141,017 frames)
covering various distances, sizes, visibility, and scenes captured by different
spectral ranges. In addition, we develop a spatio-temporal aggregation
framework for IOD, in which different backbones are deployed and a
spatio-temporal aggregation loss (STAloss) is elaborately designed to leverage
the consistency along the time axis. Experiments conducted on IOD-Video dataset
demonstrate that spatio-temporal aggregation can significantly improve the
performance of IOD. We hope our work will attract further researches into this
valuable yet challenging task. The code will be available at:
\url{https://github.com/CalayZhou/IOD-Video}.
Related papers
- STCMOT: Spatio-Temporal Cohesion Learning for UAV-Based Multiple Object Tracking [13.269416985959404]
Multiple object tracking (MOT) in Unmanned Aerial Vehicle (UAV) videos is important for diverse applications in computer vision.
We propose a novel Spatio-Temporal Cohesion Multiple Object Tracking framework (STCMOT)
We use historical embedding features to model the representation of ReID and detection features in a sequential order.
Our framework sets a new state-of-the-art performance in MOTA and IDF1 metrics.
arXiv Detail & Related papers (2024-09-17T14:34:18Z) - Spatio-Temporal-based Context Fusion for Video Anomaly Detection [1.7710335706046505]
Video anomaly aims to discover abnormal events in videos, and the principal objects are target objects such as people and vehicles.
Most existing methods only focus on the temporal context, ignoring the role of the spatial context in anomaly detection.
This paper proposes a video anomaly detection algorithm based on target-temporal context fusion.
arXiv Detail & Related papers (2022-10-18T04:07:10Z) - Tackling Background Distraction in Video Object Segmentation [7.187425003801958]
A video object segmentation (VOS) aims to densely track certain objects in videos.
One of the main challenges in this task is the existence of background distractors that appear similar to the target objects.
We propose three novel strategies to suppress such distractors.
Our model achieves a comparable performance to contemporary state-of-the-art approaches, even with real-time performance.
arXiv Detail & Related papers (2022-07-14T14:25:19Z) - Implicit Motion Handling for Video Camouflaged Object Detection [60.98467179649398]
We propose a new video camouflaged object detection (VCOD) framework.
It can exploit both short-term and long-term temporal consistency to detect camouflaged objects from video frames.
arXiv Detail & Related papers (2022-03-14T17:55:41Z) - Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for
Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query.
Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information.
We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction
Detection in Videos [91.29436920371003]
We propose a simple yet effective architecture named Spatial-Temporal HOI Detection (ST-HOI)
We use temporal information such as human and object trajectories, correctly-localized visual features, and spatial-temporal masking pose features.
We construct a new video HOI benchmark dubbed VidHOI where our proposed approach serves as a solid baseline.
arXiv Detail & Related papers (2021-05-25T07:54:35Z) - DS-Net: Dynamic Spatiotemporal Network for Video Salient Object
Detection [78.04869214450963]
We propose a novel dynamic temporal-temporal network (DSNet) for more effective fusion of temporal and spatial information.
We show that the proposed method achieves superior performance than state-of-the-art algorithms.
arXiv Detail & Related papers (2020-12-09T06:42:30Z) - Benchmarking Unsupervised Object Representations for Video Sequences [111.81492107649889]
We compare the perceptual abilities of four object-centric approaches: ViMON, OP3, TBA and SCALOR.
Our results suggest that the architectures with unconstrained latent representations learn more powerful representations in terms of object detection, segmentation and tracking.
Our benchmark may provide fruitful guidance towards learning more robust object-centric video representations.
arXiv Detail & Related papers (2020-06-12T09:37:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.