Joint Representation of Temporal Image Sequences and Object Motion for
Video Object Detection
- URL: http://arxiv.org/abs/2011.10278v1
- Date: Fri, 20 Nov 2020 08:46:12 GMT
- Title: Joint Representation of Temporal Image Sequences and Object Motion for
Video Object Detection
- Authors: Junho Koh, Jaekyum Kim, Younji Shin, Byeongwon Lee, Seungji Yang and
Jun Won Choi
- Abstract summary: We propose a new video object detector (VoD) method referred to as temporal feature aggregation and motion-aware VoD (TM-VoD)
TM-VoD aggregates visual feature maps extracted by convolutional neural networks applying the temporal attention gating and spatial feature alignment.
The proposed method outperforms existing VoD methods and achieves a performance comparable to that of state-of-the-art VoDs.
- Score: 9.699309217726691
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we propose a new video object detector (VoD) method referred
to as temporal feature aggregation and motion-aware VoD (TM-VoD), which
produces a joint representation of temporal image sequences and object motion.
The proposed TM-VoD aggregates visual feature maps extracted by convolutional
neural networks applying the temporal attention gating and spatial feature
alignment. This temporal feature aggregation is performed in two stages in a
hierarchical fashion. In the first stage, the visual feature maps are fused at
a pixel level via gated attention model. In the second stage, the proposed
method aggregates the features after aligning the object features using
temporal box offset calibration and weights them according to the cosine
similarity measure. The proposed TM-VoD also finds the representation of the
motion of objects in two successive steps. The pixel-level motion features are
first computed based on the incremental changes between the adjacent visual
feature maps. Then, box-level motion features are obtained from both the region
of interest (RoI)-aligned pixel-level motion features and the sequential
changes of the box coordinates. Finally, all these features are concatenated to
produce a joint representation of the objects for VoD. The experiments
conducted on the ImageNet VID dataset demonstrate that the proposed method
outperforms existing VoD methods and achieves a performance comparable to that
of state-of-the-art VoDs.
Related papers
- Motion-Aware Concept Alignment for Consistent Video Editing [57.08108545219043]
We introduce MoCA-Video (Motion-Aware Concept Alignment in Video), a training-free framework bridging the gap between image-domain semantic mixing and video.<n>Given a generated video and a user-provided reference image, MoCA-Video injects the semantic features of the reference image into a specific object within the video.<n>We evaluate MoCA's performance using the standard SSIM, image-level LPIPS, temporal LPIPS, and introduce a novel metric CASS (Conceptual Alignment Shift Score) to evaluate the consistency and effectiveness of the visual shifts between the source prompt and the modified video frames
arXiv Detail & Related papers (2025-06-01T13:28:04Z) - In-2-4D: Inbetweening from Two Single-View Images to 4D Generation [54.62824686338408]
We propose a new problem, In-between2-4D, for generative 4D (i.e., 3D + motion) in Splating from a minimalistic input setting.
Given two images representing the start and end states of an object in motion, our goal is to generate and reconstruct the motion in 4D.
arXiv Detail & Related papers (2025-04-11T09:01:09Z) - POMATO: Marrying Pointmap Matching with Temporal Motion for Dynamic 3D Reconstruction [53.19968902152528]
We present POMATO, a unified framework for dynamic 3D reconstruction by marrying pointmap matching with temporal motion.
Specifically, our method learns an explicit matching relationship by mapping RGB pixels from both dynamic and static regions across different views to 3D pointmaps.
We show the effectiveness of the proposed pointmap matching and temporal fusion paradigm by demonstrating the remarkable performance across multiple downstream tasks.
arXiv Detail & Related papers (2025-04-08T05:33:13Z) - JARViS: Detecting Actions in Video Using Unified Actor-Scene Context Relation Modeling [8.463489896549161]
Two-stage Video localization (VAD) is a formidable task that involves the localization and classification of actions within the spatial and temporal dimensions of a video clip.
We propose a two-stage VAD framework called Joint Actor-scene context Relation modeling (JARViS)
JARViS consolidates cross-modal action semantics distributed globally across spatial and temporal dimensions using Transformer attention.
arXiv Detail & Related papers (2024-08-07T08:08:08Z) - A Semantic and Motion-Aware Spatiotemporal Transformer Network for Action Detection [7.202931445597171]
We present a novel network that detects actions in untrimmed videos.
The network encodes the locations of action semantics in video frames utilizing motion-aware 2D positional encoding.
The approach outperforms the state-the-art solutions on four proposed datasets.
arXiv Detail & Related papers (2024-05-13T21:47:35Z) - Implicit Motion Handling for Video Camouflaged Object Detection [60.98467179649398]
We propose a new video camouflaged object detection (VCOD) framework.
It can exploit both short-term and long-term temporal consistency to detect camouflaged objects from video frames.
arXiv Detail & Related papers (2022-03-14T17:55:41Z) - Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for
Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query.
Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information.
We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z) - Slow-Fast Visual Tempo Learning for Video-based Action Recognition [78.3820439082979]
Action visual tempo characterizes the dynamics and the temporal scale of an action.
Previous methods capture the visual tempo either by sampling raw videos with multiple rates, or by hierarchically sampling backbone features.
We propose a Temporal Correlation Module (TCM) to extract action visual tempo from low-level backbone features at single-layer remarkably.
arXiv Detail & Related papers (2022-02-24T14:20:04Z) - Recent Trends in 2D Object Detection and Applications in Video Event
Recognition [0.76146285961466]
We discuss the pioneering works in object detection, followed by the recent breakthroughs that employ deep learning.
We highlight recent datasets for 2D object detection both in images and videos, and present a comparative performance summary of various state-of-the-art object detection techniques.
arXiv Detail & Related papers (2022-02-07T14:15:11Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z) - ST-DETR: Spatio-Temporal Object Traces Attention Detection Transformer [2.4366811507669124]
We propose a Spatio-Temporal Transformer-based architecture for object detection from a sequence of temporal frames.
We employ the full attention mechanisms to take advantage of the features correlations over both dimensions.
Results show a significant 5% mAP improvement on the KITTI MOD dataset.
arXiv Detail & Related papers (2021-07-13T07:38:08Z) - LiDAR-based Online 3D Video Object Detection with Graph-based Message
Passing and Spatiotemporal Transformer Attention [100.52873557168637]
3D object detectors usually focus on the single-frame detection, while ignoring the information in consecutive point cloud frames.
In this paper, we propose an end-to-end online 3D video object detector that operates on point sequences.
arXiv Detail & Related papers (2020-04-03T06:06:52Z) - Motion-Attentive Transition for Zero-Shot Video Object Segmentation [99.44383412488703]
We present a Motion-Attentive Transition Network (MATNet) for zero-shot object segmentation.
An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder.
In this way, the encoder becomes deeply internative, allowing for closely hierarchical interactions between object motion and appearance.
arXiv Detail & Related papers (2020-03-09T16:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.