TF-Blender: Temporal Feature Blender for Video Object Detection
- URL: http://arxiv.org/abs/2108.05821v1
- Date: Thu, 12 Aug 2021 16:01:34 GMT
- Title: TF-Blender: Temporal Feature Blender for Video Object Detection
- Authors: Yiming Cui, Liqi Yan, Zhiwen Cao, Dongfang Liu
- Abstract summary: Video objection detection is a challenging task because isolated video frames may encounter appearance deterioration.
We propose TF-Blender, which includes three modules: 1) Temporal relation mod-els the relations between the current frame and its neighboring frames to preserve spatial information.
For its simplicity, TF-Blender can be effortlessly plugged into any detection network to improve detection behavior.
- Score: 6.369234802164117
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video objection detection is a challenging task because isolated video frames
may encounter appearance deterioration, which introduces great confusion for
detection. One of the popular solutions is to exploit the temporal information
and enhance per-frame representation through aggregating features from
neighboring frames. Despite achieving improvements in detection, existing
methods focus on the selection of higher-level video frames for aggregation
rather than modeling lower-level temporal relations to increase the feature
representation. To address this limitation, we propose a novel solution named
TF-Blender,which includes three modules: 1) Temporal relation mod-els the
relations between the current frame and its neighboring frames to preserve
spatial information. 2). Feature adjustment enriches the representation of
every neigh-boring feature map; 3) Feature blender combines outputs from the
first two modules and produces stronger features for the later detection tasks.
For its simplicity, TF-Blender can be effortlessly plugged into any detection
network to improve detection behavior. Extensive evaluations on ImageNet VID
and YouTube-VIS benchmarks indicate the performance guarantees of using
TF-Blender on recent state-of-the-art methods.
Related papers
- STF: Spatio-Temporal Fusion Module for Improving Video Object Detection [7.213855322671065]
Consive frames in a video contain redundancy, but they may also contain complementary information for the detection task.
We propose a-temporal fusion framework (STF) to leverage this complementary information.
The proposed-temporal fusion module leads to improved detection performance compared to baseline object detectors.
arXiv Detail & Related papers (2024-02-16T15:19:39Z) - SODFormer: Streaming Object Detection with Transformer Using Events and
Frames [31.293847706713052]
DA camera, streaming two complementary sensing modalities of asynchronous events and frames, has gradually been used to address major object detection challenges.
We propose a novel streaming object detector with SODFormer, which first integrates events and frames to continuously detect objects in an asynchronous manner.
arXiv Detail & Related papers (2023-08-08T04:53:52Z) - You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z) - Graph Neural Network and Spatiotemporal Transformer Attention for 3D
Video Object Detection from Point Clouds [94.21415132135951]
We propose to detect 3D objects by exploiting temporal information in multiple frames.
We implement our algorithm based on prevalent anchor-based and anchor-free detectors.
arXiv Detail & Related papers (2022-07-26T05:16:28Z) - Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for
Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query.
Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information.
We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z) - Background-Click Supervision for Temporal Action Localization [82.4203995101082]
Weakly supervised temporal action localization aims at learning the instance-level action pattern from the video-level labels, where a significant challenge is action-context confusion.
One recent work builds an action-click supervision framework.
It requires similar annotation costs but can steadily improve the localization performance when compared to the conventional weakly supervised methods.
In this paper, by revealing that the performance bottleneck of the existing approaches mainly comes from the background errors, we find that a stronger action localizer can be trained with labels on the background video frames rather than those on the action frames.
arXiv Detail & Related papers (2021-11-24T12:02:52Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.