Visual Object Tracking across Diverse Data Modalities: A Review
- URL: http://arxiv.org/abs/2412.09991v1
- Date: Fri, 13 Dec 2024 09:25:18 GMT
- Title: Visual Object Tracking across Diverse Data Modalities: A Review
- Authors: Mengmeng Wang, Teli Ma, Shuo Xin, Xiaojun Hou, Jiazheng Xing, Guang Dai, Jingdong Wang, Yong Liu,
- Abstract summary: Visual Object Tracking (VOT) is an attractive and significant research area in computer vision.<n>We first review three types of mainstream single-modal VOT, including RGB, thermal infrared and point cloud tracking.<n>Then we summarize four kinds of multi-modal VOT, including RGB-Depth, RGB-Thermal, RGB-LiDAR and RGB-Language.
- Score: 33.006051781123595
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual Object Tracking (VOT) is an attractive and significant research area in computer vision, which aims to recognize and track specific targets in video sequences where the target objects are arbitrary and class-agnostic. The VOT technology could be applied in various scenarios, processing data of diverse modalities such as RGB, thermal infrared and point cloud. Besides, since no one sensor could handle all the dynamic and varying environments, multi-modal VOT is also investigated. This paper presents a comprehensive survey of the recent progress of both single-modal and multi-modal VOT, especially the deep learning methods. Specifically, we first review three types of mainstream single-modal VOT, including RGB, thermal infrared and point cloud tracking. In particular, we conclude four widely-used single-modal frameworks, abstracting their schemas and categorizing the existing inheritors. Then we summarize four kinds of multi-modal VOT, including RGB-Depth, RGB-Thermal, RGB-LiDAR and RGB-Language. Moreover, the comparison results in plenty of VOT benchmarks of the discussed modalities are presented. Finally, we provide recommendations and insightful observations, inspiring the future development of this fast-growing literature.
Related papers
- Multi-modal Fusion and Query Refinement Network for Video Moment Retrieval and Highlight Detection [1.5236380958983644]
Given a video and a linguistic query, video moment retrieval and highlight detection (MR&HD) aim to locate all the relevant spans while simultaneously predicting saliency scores.
Most existing methods utilize RGB images as input, overlooking the inherent multi-modal visual signals like optical flow and depth map.
We propose a Multi-modal Fusion and Query Refinement Network (MRNet) to learn complementary information from multi-modal cues.
arXiv Detail & Related papers (2025-01-18T08:09:44Z) - Awesome Multi-modal Object Tracking [41.76977058932557]
Multi-modal object tracking (MMOT) is an emerging field that combines data from various modalities to estimate the state of an arbitrary object in a video sequence.
To track the latest progress in MMOT, we conduct a comprehensive investigation in this report.
arXiv Detail & Related papers (2024-05-23T05:58:10Z) - Salient Object Detection in RGB-D Videos [11.805682025734551]
This paper makes two primary contributions: the dataset and the model.
We construct the RDVS dataset, a new RGB-D VSOD dataset with realistic depth.
We introduce DCTNet+, a three-stream network tailored for RGB-D VSOD.
arXiv Detail & Related papers (2023-10-24T03:18:07Z) - A Multi-modal Approach to Single-modal Visual Place Classification [2.580765958706854]
Multi-sensor fusion approaches combining RGB and depth (D) have gained popularity in recent years.
We reformulate the single-modal RGB image classification task as a pseudo multi-modal RGB-D classification problem.
A practical, fully self-supervised framework for training, appropriately processing, fusing, and classifying these two modalities is described.
arXiv Detail & Related papers (2023-05-10T14:04:21Z) - Visual Prompt Multi-Modal Tracking [71.53972967568251]
Visual Prompt multi-modal Tracking (ViPT) learns the modal-relevant prompts to adapt the frozen pre-trained foundation model to various downstream multimodal tracking tasks.
ViPT outperforms the full fine-tuning paradigm on multiple downstream tracking tasks including RGB+Depth, RGB+Thermal, and RGB+Event tracking.
arXiv Detail & Related papers (2023-03-20T01:51:07Z) - Learning Dual-Fused Modality-Aware Representations for RGBD Tracking [67.14537242378988]
Compared with the traditional RGB object tracking, the addition of the depth modality can effectively solve the target and background interference.
Some existing RGBD trackers use the two modalities separately and thus some particularly useful shared information between them is ignored.
We propose a novel Dual-fused Modality-aware Tracker (termed DMTracker) which aims to learn informative and discriminative representations of the target objects for robust RGBD tracking.
arXiv Detail & Related papers (2022-11-06T07:59:07Z) - Visible-Thermal UAV Tracking: A Large-Scale Benchmark and New Baseline [80.13652104204691]
In this paper, we construct a large-scale benchmark with high diversity for visible-thermal UAV tracking (VTUAV)
We provide a coarse-to-fine attribute annotation, where frame-level attributes are provided to exploit the potential of challenge-specific trackers.
In addition, we design a new RGB-T baseline, named Hierarchical Multi-modal Fusion Tracker (HMFT), which fuses RGB-T data in various levels.
arXiv Detail & Related papers (2022-04-08T15:22:33Z) - Multi-modal Visual Tracking: Review and Experimental Comparison [85.20414397784937]
We summarize the multi-modal tracking algorithms, especially visible-depth (RGB-D) tracking and visible-thermal (RGB-T) tracking.
We conduct experiments to analyze the effectiveness of trackers on five datasets.
arXiv Detail & Related papers (2020-12-08T02:39:38Z) - Bifurcated backbone strategy for RGB-D salient object detection [168.19708737906618]
We leverage the inherent multi-modal and multi-level nature of RGB-D salient object detection to devise a novel cascaded refinement network.
Our architecture, named Bifurcated Backbone Strategy Network (BBS-Net), is simple, efficient, and backbone-independent.
arXiv Detail & Related papers (2020-07-06T13:01:30Z) - Is Depth Really Necessary for Salient Object Detection? [50.10888549190576]
We make the first attempt in realizing an unified depth-aware framework with only RGB information as input for inference.
Not only surpasses the state-of-the-art performances on five public RGB SOD benchmarks, but also surpasses the RGBD-based methods on five benchmarks by a large margin.
arXiv Detail & Related papers (2020-05-30T13:40:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.