Temporal RoI Align for Video Object Recognition
- URL: http://arxiv.org/abs/2109.03495v1
- Date: Wed, 8 Sep 2021 08:35:21 GMT
- Title: Temporal RoI Align for Video Object Recognition
- Authors: Tao Gong, Kai Chen, Xinjiang Wang, Qi Chu, Feng Zhu, Dahua Lin,
Nenghai Yu, Huamin Feng
- Abstract summary: The proposed Temporal RoI Align operator can extract temporal information from the entire video for proposals.
We integrate it into single-frame video detectors and other state-of-the-art video detectors, and conduct quantitative experiments to demonstrate that the proposed Temporal RoI Align operator can consistently and significantly boost the performance.
- Score: 107.07049115214924
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video object detection is challenging in the presence of appearance
deterioration in certain video frames. Therefore, it is a natural choice to
aggregate temporal information from other frames of the same video into the
current frame. However, RoI Align, as one of the most core procedures of video
detectors, still remains extracting features from a single-frame feature map
for proposals, making the extracted RoI features lack temporal information from
videos. In this work, considering the features of the same object instance are
highly similar among frames in a video, a novel Temporal RoI Align operator is
proposed to extract features from other frames feature maps for current frame
proposals by utilizing feature similarity. The proposed Temporal RoI Align
operator can extract temporal information from the entire video for proposals.
We integrate it into single-frame video detectors and other state-of-the-art
video detectors, and conduct quantitative experiments to demonstrate that the
proposed Temporal RoI Align operator can consistently and significantly boost
the performance. Besides, the proposed Temporal RoI Align can also be applied
into video instance segmentation.
Related papers
- Agent-based Video Trimming [17.519404251018308]
We introduce a novel task called Video Trimming (VT)
VT focuses on detecting wasted footage, selecting valuable segments, and composing them into a final video with a coherent story.
AVT received more favorable evaluations in user studies and demonstrated superior mAP and precision on the YouTube Highlights, TVSum, and our own dataset for the highlight detection task.
arXiv Detail & Related papers (2024-12-12T17:59:28Z) - Multi-grained Temporal Prototype Learning for Few-shot Video Object
Segmentation [156.4142424784322]
Few-Shot Video Object (FSVOS) aims to segment objects in a query video with the same category defined by a few annotated support images.
We propose to leverage multi-grained temporal guidance information for handling the temporal correlation nature of video data.
Our proposed video IPMT model significantly outperforms previous models on two benchmark datasets.
arXiv Detail & Related papers (2023-09-20T09:16:34Z) - Aggregating Nearest Sharp Features via Hybrid Transformers for Video Deblurring [70.06559269075352]
We propose a video deblurring method that leverages both neighboring frames and existing sharp frames using hybrid Transformers for feature aggregation.
To aggregate nearest sharp features from detected sharp frames, we utilize a global Transformer with multi-scale matching capability.
Our proposed method outperforms state-of-the-art video deblurring methods as well as event-driven video deblurring methods in terms of quantitative metrics and visual quality.
arXiv Detail & Related papers (2023-09-13T16:12:11Z) - Video Imprint [107.1365846180187]
A new unified video analytics framework (ER3) is proposed for complex event retrieval, recognition and recounting.
The proposed video imprint representation exploits temporal correlations among image features across video frames.
The video imprint is fed into a reasoning network and a feature aggregation module, for event recognition/recounting and event retrieval tasks, respectively.
arXiv Detail & Related papers (2021-06-07T00:32:47Z) - A Video Is Worth Three Views: Trigeminal Transformers for Video-based
Person Re-identification [77.08204941207985]
Video-based person re-identification (Re-ID) aims to retrieve video sequences of the same person under non-overlapping cameras.
We propose a novel framework named Trigeminal Transformers (TMT) for video-based person Re-ID.
arXiv Detail & Related papers (2021-04-05T02:50:16Z) - Temporal Context Aggregation for Video Retrieval with Contrastive
Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features.
The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.