Robust Object Modeling for Visual Tracking
- URL: http://arxiv.org/abs/2308.05140v1
- Date: Wed, 9 Aug 2023 15:32:03 GMT
- Title: Robust Object Modeling for Visual Tracking
- Authors: Yidong Cai, Jie Liu, Jie Tang, Gangshan Wu
- Abstract summary: We propose a robust object modeling framework for visual tracking (ROMTrack)
ROMTrack simultaneously models the inherent template and the hybrid template features.
Variation tokens are adaptable to object deformation and appearance variations, which can boost overall performance with negligible computation.
- Score: 36.05869157990915
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Object modeling has become a core part of recent tracking frameworks. Current
popular tackers use Transformer attention to extract the template feature
separately or interactively with the search region. However, separate template
learning lacks communication between the template and search regions, which
brings difficulty in extracting discriminative target-oriented features. On the
other hand, interactive template learning produces hybrid template features,
which may introduce potential distractors to the template via the cluttered
search regions. To enjoy the merits of both methods, we propose a robust object
modeling framework for visual tracking (ROMTrack), which simultaneously models
the inherent template and the hybrid template features. As a result, harmful
distractors can be suppressed by combining the inherent features of target
objects with search regions' guidance. Target-related features can also be
extracted using the hybrid template, thus resulting in a more robust object
modeling framework. To further enhance robustness, we present novel variation
tokens to depict the ever-changing appearance of target objects. Variation
tokens are adaptable to object deformation and appearance variations, which can
boost overall performance with negligible computation. Experiments show that
our ROMTrack sets a new state-of-the-art on multiple benchmarks.
Related papers
- A Plug-and-Play Method for Rare Human-Object Interactions Detection by Bridging Domain Gap [50.079224604394]
We present a novel model-agnostic framework called textbfContext-textbfEnhanced textbfFeature textbfAment (CEFA)
CEFA consists of a feature alignment module and a context enhancement module.
Our method can serve as a plug-and-play module to improve the detection performance of HOI models on rare categories.
arXiv Detail & Related papers (2024-07-31T08:42:48Z) - Learning from Exemplars for Interactive Image Segmentation [15.37506525730218]
We introduce novel interactive segmentation frameworks for both a single object and multiple objects in the same category.
Our model reduces users' labor by around 15%, requiring two fewer clicks to achieve target IoUs 85% and 90%.
arXiv Detail & Related papers (2024-06-17T12:38:01Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Labeling Indoor Scenes with Fusion of Out-of-the-Box Perception Models [4.157013247909771]
We propose to leverage the recent advancements in state-of-the-art models for bottom-up segmentation (SAM), object detection (Detic), and semantic segmentation (MaskFormer)
We aim to develop a cost-effective labeling approach to obtain pseudo-labels for semantic segmentation and object instance detection in indoor environments.
We demonstrate the effectiveness of the proposed approach on the Active Vision dataset and the ADE20K dataset.
arXiv Detail & Related papers (2023-11-17T21:58:26Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - Generative Target Update for Adaptive Siamese Tracking [7.662745552551165]
Siamese trackers perform similarity matching with templates (i.e., target models) to localize objects within a search region.
Several strategies have been proposed in the literature to update a template based on the tracker output, typically extracted from the target search region in the current frame.
This paper proposes a model adaptation method for Siamese trackers that uses a generative model to produce a synthetic template from the object search regions of several previous frames.
arXiv Detail & Related papers (2022-02-21T00:22:49Z) - Synthesizing the Unseen for Zero-shot Object Detection [72.38031440014463]
We propose to synthesize visual features for unseen classes, so that the model learns both seen and unseen objects in the visual domain.
We use a novel generative model that uses class-semantics to not only generate the features but also to discriminatively separate them.
arXiv Detail & Related papers (2020-10-19T12:36:11Z) - Attention-based Joint Detection of Object and Semantic Part [4.389917490809522]
Our model is created on top of two Faster-RCNN models that share their features to get enhanced representations of both.
Experiments on the PASCAL-Part 2010 dataset show that joint detection can simultaneously improve both object detection and part detection.
arXiv Detail & Related papers (2020-07-05T18:54:10Z) - Object-Centric Image Generation from Layouts [93.10217725729468]
We develop a layout-to-image-generation method to generate complex scenes with multiple objects.
Our method learns representations of the spatial relationships between objects in the scene, which lead to our model's improved layout-fidelity.
We introduce SceneFID, an object-centric adaptation of the popular Fr'echet Inception Distance metric, that is better suited for multi-object images.
arXiv Detail & Related papers (2020-03-16T21:40:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.