Related papers: Referring Multi-Object Tracking

Referring Multi-Object Tracking

URL: http://arxiv.org/abs/2303.03366v1
Date: Mon, 6 Mar 2023 18:50:06 GMT
Title: Referring Multi-Object Tracking
Authors: Dongming Wu, Wencheng Han, Tiancai Wang, Xingping Dong, Xiangyu Zhang, Jianbing Shen
Abstract summary: We propose a new and general referring understanding task, termed referring multi-object tracking (RMOT) Its core idea is to employ a language expression as a semantic cue to guide the prediction of multi-object tracking. To the best of our knowledge, it is the first work to achieve an arbitrary number of referent object predictions in videos.
Score: 78.63827591797124
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing referring understanding tasks tend to involve the detection of a single text-referred object. In this paper, we propose a new and general referring understanding task, termed referring multi-object tracking (RMOT). Its core idea is to employ a language expression as a semantic cue to guide the prediction of multi-object tracking. To the best of our knowledge, it is the first work to achieve an arbitrary number of referent object predictions in videos. To push forward RMOT, we construct one benchmark with scalable expressions based on KITTI, named Refer-KITTI. Specifically, it provides 18 videos with 818 expressions, and each expression in a video is annotated with an average of 10.7 objects. Further, we develop a transformer-based architecture TransRMOT to tackle the new task in an online manner, which achieves impressive detection performance and outperforms other counterparts.

Related papers

ReferGPT: Towards Zero-Shot Referring Multi-Object Tracking [17.736434513456576]
ReferGPT is a novel zero-shot referring multi-object tracking framework. We provide a multi-modal large language model (MLLM) with spatial knowledge enabling it to generate 3D-aware captions. We also propose a robust query-matching strategy, leveraging CLIP-based semantic encoding and fuzzy matching to associate MLLM generated captions with user queries.
arXiv Detail & Related papers (2025-04-12T12:33:15Z)
Cognitive Disentanglement for Referring Multi-Object Tracking [28.325814292139686]
We propose a Cognitive Disentanglement for Referring Multi-Object Tracking (CDRMT) framework. CDRMT adapts the "what" and "where" pathways from the human visual processing system to RMOT tasks. Experiments on different benchmark datasets demonstrate that CDRMT achieves substantial improvements over state-of-the-art methods.
arXiv Detail & Related papers (2025-03-14T15:21:54Z)
Temporal-Enhanced Multimodal Transformer for Referring Multi-Object Tracking and Segmentation [28.16053631036079]
Referring multi-object tracking (RMOT) is an emerging cross-modal task that aims to locate an arbitrary number of target objects in a video. We introduce a compact Transformer-based method, termed TenRMOT, to exploit the advantages of Transformer architecture. TenRMOT demonstrates superior performance on both the referring multi-object tracking and the segmentation tasks.
arXiv Detail & Related papers (2024-10-17T11:07:05Z)
VISA: Reasoning Video Object Segmentation via Large Language Models [64.33167989521357]
We introduce a new task, Reasoning Video Object (ReasonVOS) This task aims to generate a sequence of segmentation masks in response to implicit text queries that require complex reasoning abilities. We introduce VISA (Video-based large language Instructed Assistant) to tackle ReasonVOS.
arXiv Detail & Related papers (2024-07-16T02:29:29Z)
Bootstrapping Referring Multi-Object Tracking [14.46285727127232]
Referring multi-object tracking (RMOT) aims at detecting and tracking multiple objects following human instruction represented by a natural language expression. Our key idea is to bootstrap the task of referring multi-object tracking by introducing discriminative language words.
arXiv Detail & Related papers (2024-06-07T16:02:10Z)
OW-VISCapTor: Abstractors for Open-World Video Instance Segmentation and Captioning [95.6696714640357]
We propose a new task 'open-world video instance segmentation and captioning' It requires to detect, segment, track and describe with rich captions never before seen objects. We develop an object abstractor and an object-to-text abstractor.
arXiv Detail & Related papers (2024-04-04T17:59:58Z)
DOCTR: Disentangled Object-Centric Transformer for Point Scene Understanding [7.470587868134298]
Point scene understanding is a challenging task to process real-world scene point cloud. Recent state-of-the-art method first segments each object and then processes them independently with multiple stages for the different sub-tasks. We propose a novel Disentangled Object-Centric TRansformer (DOCTR) that explores object-centric representation.
arXiv Detail & Related papers (2024-03-25T05:22:34Z)
Type-to-Track: Retrieve Any Object via Prompt-based Tracking [34.859061177766016]
This paper introduces a novel paradigm for Multiple Object Tracking called Type-to-Track. Type-to-Track allows users to track objects in videos by typing natural language descriptions. We present a new dataset for that Grounded Multiple Object Tracking task, called GroOT.
arXiv Detail & Related papers (2023-05-22T21:25:27Z)
Universal Instance Perception as Object Discovery and Retrieval [90.96031157557806]
UNI reformulates diverse instance perception tasks into a unified object discovery and retrieval paradigm. It can flexibly perceive different types of objects by simply changing the input prompts. UNI shows superior performance on 20 challenging benchmarks from 10 instance-level tasks.
arXiv Detail & Related papers (2023-03-12T14:28:24Z)
End-to-end Tracking with a Multi-query Transformer [96.13468602635082]
Multiple-object tracking (MOT) is a challenging task that requires simultaneous reasoning about location, appearance, and identity of the objects in the scene over time. Our aim in this paper is to move beyond tracking-by-detection approaches, to class-agnostic tracking that performs well also for unknown object classes.
arXiv Detail & Related papers (2022-10-26T10:19:37Z)
BURST: A Benchmark for Unifying Object Recognition, Segmentation and Tracking in Video [58.71785546245467]
Multiple existing benchmarks involve tracking and segmenting objects in video. There is little interaction between them due to the use of disparate benchmark datasets and metrics. We propose BURST, a dataset which contains thousands of diverse videos with high-quality object masks. All tasks are evaluated using the same data and comparable metrics, which enables researchers to consider them in unison.
arXiv Detail & Related papers (2022-09-25T01:27:35Z)
Multi-modal Transformers Excel at Class-agnostic Object Detection [105.10403103027306]
We argue that existing methods lack a top-down supervision signal governed by human-understandable semantics. We develop an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention. We show the significance of MViT proposals in a diverse range of applications.
arXiv Detail & Related papers (2021-11-22T18:59:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.