Joint Representation of Temporal Image Sequences and Object Motion for
Video Object Detection
- URL: http://arxiv.org/abs/2011.10278v1
- Date: Fri, 20 Nov 2020 08:46:12 GMT
- Title: Joint Representation of Temporal Image Sequences and Object Motion for
Video Object Detection
- Authors: Junho Koh, Jaekyum Kim, Younji Shin, Byeongwon Lee, Seungji Yang and
Jun Won Choi
- Abstract summary: We propose a new video object detector (VoD) method referred to as temporal feature aggregation and motion-aware VoD (TM-VoD)
TM-VoD aggregates visual feature maps extracted by convolutional neural networks applying the temporal attention gating and spatial feature alignment.
The proposed method outperforms existing VoD methods and achieves a performance comparable to that of state-of-the-art VoDs.
- Score: 9.699309217726691
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we propose a new video object detector (VoD) method referred
to as temporal feature aggregation and motion-aware VoD (TM-VoD), which
produces a joint representation of temporal image sequences and object motion.
The proposed TM-VoD aggregates visual feature maps extracted by convolutional
neural networks applying the temporal attention gating and spatial feature
alignment. This temporal feature aggregation is performed in two stages in a
hierarchical fashion. In the first stage, the visual feature maps are fused at
a pixel level via gated attention model. In the second stage, the proposed
method aggregates the features after aligning the object features using
temporal box offset calibration and weights them according to the cosine
similarity measure. The proposed TM-VoD also finds the representation of the
motion of objects in two successive steps. The pixel-level motion features are
first computed based on the incremental changes between the adjacent visual
feature maps. Then, box-level motion features are obtained from both the region
of interest (RoI)-aligned pixel-level motion features and the sequential
changes of the box coordinates. Finally, all these features are concatenated to
produce a joint representation of the objects for VoD. The experiments
conducted on the ImageNet VID dataset demonstrate that the proposed method
outperforms existing VoD methods and achieves a performance comparable to that
of state-of-the-art VoDs.
Related papers
- A Semantic and Motion-Aware Spatiotemporal Transformer Network for Action Detection [7.202931445597171]
We present a novel network that detects actions in untrimmed videos.
The network encodes the locations of action semantics in video frames utilizing motion-aware 2D positional encoding.
The approach outperforms the state-the-art solutions on four proposed datasets.
arXiv Detail & Related papers (2024-05-13T21:47:35Z) - TK-Planes: Tiered K-Planes with High Dimensional Feature Vectors for Dynamic UAV-based Scenes [58.180556221044235]
We present a new approach to bridge the domain gap between synthetic and real-world data for un- manned aerial vehicle (UAV)-based perception.
Our formu- lation is designed for dynamic scenes, consisting of moving objects or human actions.
We evaluate its performance on challenging datasets, including Okutama Action and UG2.
arXiv Detail & Related papers (2024-05-04T21:55:33Z) - Spatiotemporal Multi-scale Bilateral Motion Network for Gait Recognition [3.1240043488226967]
In this paper, motivated by optical flow, the bilateral motion-oriented features are proposed.
We develop a set of multi-scale temporal representations that force the motion context to be richly described at various levels of temporal resolution.
arXiv Detail & Related papers (2022-09-26T01:36:22Z) - Implicit Motion Handling for Video Camouflaged Object Detection [60.98467179649398]
We propose a new video camouflaged object detection (VCOD) framework.
It can exploit both short-term and long-term temporal consistency to detect camouflaged objects from video frames.
arXiv Detail & Related papers (2022-03-14T17:55:41Z) - Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for
Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query.
Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information.
We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z) - Recent Trends in 2D Object Detection and Applications in Video Event
Recognition [0.76146285961466]
We discuss the pioneering works in object detection, followed by the recent breakthroughs that employ deep learning.
We highlight recent datasets for 2D object detection both in images and videos, and present a comparative performance summary of various state-of-the-art object detection techniques.
arXiv Detail & Related papers (2022-02-07T14:15:11Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z) - ST-DETR: Spatio-Temporal Object Traces Attention Detection Transformer [2.4366811507669124]
We propose a Spatio-Temporal Transformer-based architecture for object detection from a sequence of temporal frames.
We employ the full attention mechanisms to take advantage of the features correlations over both dimensions.
Results show a significant 5% mAP improvement on the KITTI MOD dataset.
arXiv Detail & Related papers (2021-07-13T07:38:08Z) - LiDAR-based Online 3D Video Object Detection with Graph-based Message
Passing and Spatiotemporal Transformer Attention [100.52873557168637]
3D object detectors usually focus on the single-frame detection, while ignoring the information in consecutive point cloud frames.
In this paper, we propose an end-to-end online 3D video object detector that operates on point sequences.
arXiv Detail & Related papers (2020-04-03T06:06:52Z) - Motion-Attentive Transition for Zero-Shot Video Object Segmentation [99.44383412488703]
We present a Motion-Attentive Transition Network (MATNet) for zero-shot object segmentation.
An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder.
In this way, the encoder becomes deeply internative, allowing for closely hierarchical interactions between object motion and appearance.
arXiv Detail & Related papers (2020-03-09T16:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.