CiteTracker: Correlating Image and Text for Visual Tracking
- URL: http://arxiv.org/abs/2308.11322v1
- Date: Tue, 22 Aug 2023 09:53:12 GMT
- Title: CiteTracker: Correlating Image and Text for Visual Tracking
- Authors: Xin Li, Yuqing Huang, Zhenyu He, Yaowei Wang, Huchuan Lu, Ming-Hsuan
Yang
- Abstract summary: We propose the CiteTracker to enhance target modeling and inference in visual tracking by connecting images and text.
Specifically, we develop a text generation module to convert the target image patch into a descriptive text.
We then associate the target description and the search image using an attention-based correlation module to generate the correlated features for target state reference.
- Score: 114.48653709286629
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing visual tracking methods typically take an image patch as the
reference of the target to perform tracking. However, a single image patch
cannot provide a complete and precise concept of the target object as images
are limited in their ability to abstract and can be ambiguous, which makes it
difficult to track targets with drastic variations. In this paper, we propose
the CiteTracker to enhance target modeling and inference in visual tracking by
connecting images and text. Specifically, we develop a text generation module
to convert the target image patch into a descriptive text containing its class
and attribute information, providing a comprehensive reference point for the
target. In addition, a dynamic description module is designed to adapt to
target variations for more effective target representation. We then associate
the target description and the search image using an attention-based
correlation module to generate the correlated features for target state
reference. Extensive experiments on five diverse datasets are conducted to
evaluate the proposed algorithm and the favorable performance against the
state-of-the-art methods demonstrates the effectiveness of the proposed
tracking method.
Related papers
- Labeling Indoor Scenes with Fusion of Out-of-the-Box Perception Models [4.157013247909771]
We propose to leverage the recent advancements in state-of-the-art models for bottom-up segmentation (SAM), object detection (Detic), and semantic segmentation (MaskFormer)
We aim to develop a cost-effective labeling approach to obtain pseudo-labels for semantic segmentation and object instance detection in indoor environments.
We demonstrate the effectiveness of the proposed approach on the Active Vision dataset and the ADE20K dataset.
arXiv Detail & Related papers (2023-11-17T21:58:26Z) - Single-Shot and Multi-Shot Feature Learning for Multi-Object Tracking [55.13878429987136]
We propose a simple yet effective two-stage feature learning paradigm to jointly learn single-shot and multi-shot features for different targets.
Our method has achieved significant improvements on MOT17 and MOT20 datasets while reaching state-of-the-art performance on DanceTrack dataset.
arXiv Detail & Related papers (2023-11-17T08:17:49Z) - Beyond One-to-One: Rethinking the Referring Image Segmentation [117.53010476628029]
Referring image segmentation aims to segment the target object referred by a natural language expression.
We propose a Dual Multi-Modal Interaction (DMMI) Network, which contains two decoder branches.
In the text-to-image decoder, text embedding is utilized to query the visual feature and localize the corresponding target.
Meanwhile, the image-to-text decoder is implemented to reconstruct the erased entity-phrase conditioned on the visual feature.
arXiv Detail & Related papers (2023-08-26T11:39:22Z) - Joint Visual Grounding and Tracking with Natural Language Specification [6.695284124073918]
Tracking by natural language specification aims to locate the referred target in a sequence based on the natural language description.
We propose a joint visual grounding and tracking framework, which reformulates grounding and tracking as a unified task.
Our method performs favorably against state-of-the-art algorithms for both tracking and grounding.
arXiv Detail & Related papers (2023-03-21T17:09:03Z) - Improving Visual Grounding with Visual-Linguistic Verification and
Iterative Reasoning [42.29650807349636]
We propose a transformer-based framework for accurate visual grounding.
We develop a visual-linguistic verification module to focus the visual features on regions relevant to the textual descriptions.
A language-guided feature encoder is also devised to aggregate the visual contexts of the target object to improve the object's distinctiveness.
arXiv Detail & Related papers (2022-04-30T13:48:15Z) - Complex Scene Image Editing by Scene Graph Comprehension [17.72638225034884]
We propose a two-stage method for achieving complex scene image editing by Scene Graph (SGC-Net)
In the first stage, we train a Region of Interest (RoI) prediction network that uses scene graphs and predict the locations of the target objects.
The second stage uses a conditional diffusion model to edit the image based on our RoI predictions.
arXiv Detail & Related papers (2022-03-24T05:12:54Z) - Learning Dynamic Compact Memory Embedding for Deformable Visual Object
Tracking [82.34356879078955]
We propose a compact memory embedding to enhance the discrimination of the segmentation-based deformable visual tracking method.
Our method outperforms the excellent segmentation-based trackers, i.e., D3S and SiamMask on DAVIS 2017 benchmark.
arXiv Detail & Related papers (2021-11-23T03:07:12Z) - Learning Object Detection from Captions via Textual Scene Attributes [70.90708863394902]
We argue that captions contain much richer information about the image, including attributes of objects and their relations.
We present a method that uses the attributes in this "textual scene graph" to train object detectors.
We empirically demonstrate that the resulting model achieves state-of-the-art results on several challenging object detection datasets.
arXiv Detail & Related papers (2020-09-30T10:59:20Z) - Visual Tracking by TridentAlign and Context Embedding [71.60159881028432]
We propose novel TridentAlign and context embedding modules for Siamese network-based visual tracking methods.
The performance of the proposed tracker is comparable to that of state-of-the-art trackers, while the proposed tracker runs at real-time speed.
arXiv Detail & Related papers (2020-07-14T08:00:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.