Bridging Images and Videos: A Simple Learning Framework for Large
Vocabulary Video Object Detection
- URL: http://arxiv.org/abs/2212.10147v1
- Date: Tue, 20 Dec 2022 10:33:03 GMT
- Title: Bridging Images and Videos: A Simple Learning Framework for Large
Vocabulary Video Object Detection
- Authors: Sanghyun Woo, Kwanyong Park, Seoung Wug Oh, In So Kweon, Joon-Young
Lee
- Abstract summary: We present a simple but effective learning framework that takes full advantage of all available training data to learn detection and tracking.
We show that consistent improvements of various large vocabulary trackers are capable, setting strong baseline results on the challenging TAO benchmarks.
- Score: 110.08925274049409
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scaling object taxonomies is one of the important steps toward a robust
real-world deployment of recognition systems. We have faced remarkable progress
in images since the introduction of the LVIS benchmark. To continue this
success in videos, a new video benchmark, TAO, was recently presented. Given
the recent encouraging results from both detection and tracking communities, we
are interested in marrying those two advances and building a strong large
vocabulary video tracker. However, supervisions in LVIS and TAO are inherently
sparse or even missing, posing two new challenges for training the large
vocabulary trackers. First, no tracking supervisions are in LVIS, which leads
to inconsistent learning of detection (with LVIS and TAO) and tracking (only
with TAO). Second, the detection supervisions in TAO are partial, which results
in catastrophic forgetting of absent LVIS categories during video fine-tuning.
To resolve these challenges, we present a simple but effective learning
framework that takes full advantage of all available training data to learn
detection and tracking while not losing any LVIS categories to recognize. With
this new learning scheme, we show that consistent improvements of various large
vocabulary trackers are capable, setting strong baseline results on the
challenging TAO benchmarks.
Related papers
- COOLer: Class-Incremental Learning for Appearance-Based Multiple Object
Tracking [32.47215340215641]
This paper extends the scope of continual learning research to class-incremental learning for multiple object tracking (MOT)
Previous solutions for continual learning of object detectors do not address the data association stage of appearance-based trackers.
We introduce COOLer, a COntrastive- and cOntinual-Learning-based tracker, which incrementally learns to track new categories while preserving past knowledge.
arXiv Detail & Related papers (2023-10-04T17:49:48Z) - Divert More Attention to Vision-Language Object Tracking [87.31882921111048]
We argue that the lack of large-scale vision-language annotated videos and ineffective vision-language interaction learning motivate us to design more effective vision-language representation for tracking.
Particularly, in this paper, we first propose a general attribute annotation strategy to decorate videos in six popular tracking benchmarks, which contributes a large-scale vision-language tracking database with more than 23,000 videos.
We then introduce a novel framework to improve tracking by learning a unified-adaptive VL representation, where the cores are the proposed asymmetric architecture search and modality mixer (ModaMixer)
arXiv Detail & Related papers (2023-07-19T15:22:06Z) - Unifying Tracking and Image-Video Object Detection [54.91658924277527]
TrIVD (Tracking and Image-Video Detection) is the first framework that unifies image OD, video OD, and MOT within one end-to-end model.
To handle the discrepancies and semantic overlaps of category labels, TrIVD formulates detection/tracking as grounding and reasons about object categories.
arXiv Detail & Related papers (2022-11-20T20:30:28Z) - Learning to Track Instances without Video Annotations [85.9865889886669]
We introduce a novel semi-supervised framework by learning instance tracking networks with only a labeled image dataset and unlabeled video sequences.
We show that even when only trained with images, the learned feature representation is robust to instance appearance variations.
In addition, we integrate this module into single-stage instance segmentation and pose estimation frameworks.
arXiv Detail & Related papers (2021-04-01T06:47:41Z) - DEFT: Detection Embeddings for Tracking [3.326320568999945]
We propose an efficient joint detection and tracking model named DEFT.
Our approach relies on an appearance-based object matching network jointly-learned with an underlying object detection network.
DEFT has comparable accuracy and speed to the top methods on 2D online tracking leaderboards.
arXiv Detail & Related papers (2021-02-03T20:00:44Z) - Unsupervised Deep Representation Learning for Real-Time Tracking [137.69689503237893]
We propose an unsupervised learning method for visual tracking.
The motivation of our unsupervised learning is that a robust tracker should be effective in bidirectional tracking.
We build our framework on a Siamese correlation filter network, and propose a multi-frame validation scheme and a cost-sensitive loss to facilitate unsupervised learning.
arXiv Detail & Related papers (2020-07-22T08:23:12Z) - TubeTK: Adopting Tubes to Track Multi-Object in a One-Step Training
Model [51.14840210957289]
Multi-object tracking is a fundamental vision problem that has been studied for a long time.
Despite the success of Tracking by Detection (TBD), this two-step method is too complicated to train in an end-to-end manner.
We propose a concise end-to-end model TubeTK which only needs one step training by introducing the bounding-tube" to indicate temporal-spatial locations of objects in a short video clip.
arXiv Detail & Related papers (2020-06-10T06:45:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.