Joint Visual Grounding and Tracking with Natural Language Specification
- URL: http://arxiv.org/abs/2303.12027v1
- Date: Tue, 21 Mar 2023 17:09:03 GMT
- Title: Joint Visual Grounding and Tracking with Natural Language Specification
- Authors: Li Zhou, Zikun Zhou, Kaige Mao, Zhenyu He
- Abstract summary: Tracking by natural language specification aims to locate the referred target in a sequence based on the natural language description.
We propose a joint visual grounding and tracking framework, which reformulates grounding and tracking as a unified task.
Our method performs favorably against state-of-the-art algorithms for both tracking and grounding.
- Score: 6.695284124073918
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tracking by natural language specification aims to locate the referred target
in a sequence based on the natural language description. Existing algorithms
solve this issue in two steps, visual grounding and tracking, and accordingly
deploy the separated grounding model and tracking model to implement these two
steps, respectively. Such a separated framework overlooks the link between
visual grounding and tracking, which is that the natural language descriptions
provide global semantic cues for localizing the target for both two steps.
Besides, the separated framework can hardly be trained end-to-end. To handle
these issues, we propose a joint visual grounding and tracking framework, which
reformulates grounding and tracking as a unified task: localizing the referred
target based on the given visual-language references. Specifically, we propose
a multi-source relation modeling module to effectively build the relation
between the visual-language references and the test image. In addition, we
design a temporal modeling module to provide a temporal clue with the guidance
of the global semantic information for our model, which effectively improves
the adaptability to the appearance variations of the target. Extensive
experimental results on TNL2K, LaSOT, OTB99, and RefCOCOg demonstrate that our
method performs favorably against state-of-the-art algorithms for both tracking
and grounding. Code is available at https://github.com/lizhou-cs/JointNLT.
Related papers
- Multi-Granularity Language-Guided Multi-Object Tracking [95.91263758294154]
We propose a new multi-object tracking framework, named LG-MOT, that explicitly leverages language information at different levels of granularity.
At inference, our LG-MOT uses the standard visual features without relying on annotated language descriptions.
Our LG-MOT achieves an absolute gain of 2.2% in terms of target object association (IDF1 score) compared to the baseline using only visual features.
arXiv Detail & Related papers (2024-06-07T11:18:40Z) - Context-Aware Integration of Language and Visual References for Natural Language Tracking [27.3884348078998]
Tracking by natural language specification (TNL) aims to consistently localize a target in a video sequence given a linguistic description in the initial frame.
We propose a joint multi-modal tracking framework with 1) a prompt module to leverage the complement between temporal visual templates and language expressions, enabling precise and context-aware appearance and linguistic cues.
This design ensures-temporal consistency by leveraging historical visual information and an integrated solution, generating predictions in a single step.
arXiv Detail & Related papers (2024-03-29T04:58:33Z) - Expand BERT Representation with Visual Information via Grounded Language
Learning with Multimodal Partial Alignment [11.148099070407431]
GroundedBERT is a grounded language learning method that enhances the BERT representation with visually grounded information.
Our proposed method significantly outperforms the baseline language models on various language tasks of the GLUE and SQuAD datasets.
arXiv Detail & Related papers (2023-12-04T03:16:48Z) - Towards Unified Token Learning for Vision-Language Tracking [65.96561538356315]
We present a vision-language (VL) tracking pipeline, termed textbfMMTrack, which casts VL tracking as a token generation task.
Our proposed framework serializes language description and bounding box into a sequence of discrete tokens.
In this new design paradigm, all token queries are required to perceive the desired target and directly predict spatial coordinates of the target.
arXiv Detail & Related papers (2023-08-27T13:17:34Z) - CiteTracker: Correlating Image and Text for Visual Tracking [114.48653709286629]
We propose the CiteTracker to enhance target modeling and inference in visual tracking by connecting images and text.
Specifically, we develop a text generation module to convert the target image patch into a descriptive text.
We then associate the target description and the search image using an attention-based correlation module to generate the correlated features for target state reference.
arXiv Detail & Related papers (2023-08-22T09:53:12Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features.
Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z) - Generalizing Multiple Object Tracking to Unseen Domains by Introducing
Natural Language Representation [33.03600813115465]
We propose to introduce natural language representation into visual MOT models for boosting the domain generalization ability.
To tackle this problem, we design two modules, namely visual context prompting (VCP) and visual-language mixing (VLM)
VLM joints the information in the generated visual prompts and the textual prompts from a pre-defined Trackbook to obtain instance-level pseudo textual description.
Through training models on MOT17 and validating them on MOT20, we observe that the pseudo textual descriptions generated by our proposed modules improve the generalization performance of query-based trackers by large margins.
arXiv Detail & Related papers (2022-12-03T07:57:31Z) - Learning Point-Language Hierarchical Alignment for 3D Visual Grounding [35.17185775314988]
This paper presents a novel hierarchical alignment model (HAM) that learns multi-granularity visual and linguistic representations in an end-to-end manner.
We extract key points and proposal points to model 3D contexts and instances, and propose point-language alignment with context modulation.
To further capture both global and local relationships, we propose a spatially multi-granular modeling scheme.
arXiv Detail & Related papers (2022-10-22T18:02:10Z) - Are We There Yet? Learning to Localize in Embodied Instruction Following [1.7300690315775575]
Action Learning From Realistic Environments and Directives (ALFRED) is a recently proposed benchmark for this problem.
Key challenges for this task include localizing target locations and navigating to them through visual inputs.
We augment the agent's field of view during navigation subgoals with multiple viewing angles, and train the agent to predict its relative spatial relation to the target location at each timestep.
arXiv Detail & Related papers (2021-01-09T21:49:41Z) - Language and Visual Entity Relationship Graph for Agent Navigation [54.059606864535304]
Vision-and-Language Navigation (VLN) requires an agent to navigate in a real-world environment following natural language instructions.
We propose a novel Language and Visual Entity Relationship Graph for modelling the inter-modal relationships between text and vision.
Experiments show that by taking advantage of the relationships we are able to improve over state-of-the-art.
arXiv Detail & Related papers (2020-10-19T08:25:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.