A Generic Object Re-identification System for Short Videos
- URL: http://arxiv.org/abs/2102.05275v1
- Date: Wed, 10 Feb 2021 05:45:09 GMT
- Title: A Generic Object Re-identification System for Short Videos
- Authors: Tairu Qiu, Guanxian Chen, Zhongang Qi, Bin Li, Ying Shan, Xiangyang
Xue
- Abstract summary: A Temporal Information Fusion Network (TIFN) is proposed in the object detection module.
A Cross-Layer Pointwise Siamese Network (CPSN) is proposed in the tracking module to enhance the robustness of the appearance model.
Two challenge datasets containing real-world short videos are built for video object trajectory extraction and generic object re-identification.
- Score: 39.662850217144964
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Short video applications like TikTok and Kwai have been a great hit recently.
In order to meet the increasing demands and take full advantage of visual
information in short videos, objects in each short video need to be located and
analyzed as an upstream task. A question is thus raised -- how to improve the
accuracy and robustness of object detection, tracking, and re-identification
across tons of short videos with hundreds of categories and complicated visual
effects (VFX). To this end, a system composed of a detection module, a tracking
module and a generic object re-identification module, is proposed in this
paper, which captures features of major objects from short videos. In
particular, towards the high efficiency demands in practical short video
application, a Temporal Information Fusion Network (TIFN) is proposed in the
object detection module, which shows comparable accuracy and improved time
efficiency to the state-of-the-art video object detector. Furthermore, in order
to mitigate the fragmented issue of tracklets in short videos, a Cross-Layer
Pointwise Siamese Network (CPSN) is proposed in the tracking module to enhance
the robustness of the appearance model. Moreover, in order to evaluate the
proposed system, two challenge datasets containing real-world short videos are
built for video object trajectory extraction and generic object
re-identification respectively. Overall, extensive experiments for each module
and the whole system demonstrate the effectiveness and efficiency of our
system.
Related papers
- PerspectiveNet: Multi-View Perception for Dynamic Scene Understanding [1.2781698000674653]
PerspectiveNet is a lightweight model for generating long descriptions across multiple camera views.
Our approach utilizes a vision encoder, a compact connector module, and large language models.
The resulting model is lightweight, ensuring efficient training and inference, while remaining highly effective.
arXiv Detail & Related papers (2024-10-22T08:57:17Z) - Training-Free Robust Interactive Video Object Segmentation [82.05906654403684]
We propose a training-free prompt tracking framework for interactive video object segmentation (I-PT)
We jointly adopt sparse points and boxes tracking, filtering out unstable points and capturing object-wise information.
Our framework has demonstrated robust zero-shot video segmentation results on popular VOS datasets.
arXiv Detail & Related papers (2024-06-08T14:25:57Z) - What is Point Supervision Worth in Video Instance Segmentation? [119.71921319637748]
Video instance segmentation (VIS) is a challenging vision task that aims to detect, segment, and track objects in videos.
We reduce the human annotations to only one point for each object in a video frame during training, and obtain high-quality mask predictions close to fully supervised models.
Comprehensive experiments on three VIS benchmarks demonstrate competitive performance of the proposed framework, nearly matching fully supervised methods.
arXiv Detail & Related papers (2024-04-01T17:38:25Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Uncertainty Aware Active Learning for Reconfiguration of Pre-trained
Deep Object-Detection Networks for New Target Domains [0.0]
Object detection is one of the most important and fundamental aspects of computer vision tasks.
To obtain training data for object detection model efficiently, many datasets opt to obtain their unannotated data in video format.
Annotating every frame from a video is costly and inefficient since many frames contain very similar information for the model to learn from.
In this paper, we proposed a novel active learning algorithm for object detection models to tackle this problem.
arXiv Detail & Related papers (2023-03-22T17:14:10Z) - A novel efficient Multi-view traffic-related object detection framework [17.50049841016045]
We propose a novel traffic-related framework named CEVAS to achieve efficient object detection using multi-view video data.
Results show that our framework significantly reduces response latency while achieving the same detection accuracy as the state-of-the-art methods.
arXiv Detail & Related papers (2023-02-23T06:42:37Z) - Spatio-Temporal Learnable Proposals for End-to-End Video Object
Detection [12.650574326251023]
We present SparseVOD, a novel video object detection pipeline that employs Sparse R-CNN to exploit temporal information.
Our method significantly improves the single-frame Sparse RCNN by 8%-9% in mAP.
arXiv Detail & Related papers (2022-10-05T16:17:55Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - Video Super-resolution with Temporal Group Attention [127.21615040695941]
We propose a novel method that can effectively incorporate temporal information in a hierarchical way.
The input sequence is divided into several groups, with each one corresponding to a kind of frame rate.
It achieves favorable performance against state-of-the-art methods on several benchmark datasets.
arXiv Detail & Related papers (2020-07-21T04:54:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.