Related papers: Omni Survey for Multimodality Analysis in Visual Object Tracking

Omni Survey for Multimodality Analysis in Visual Object Tracking

URL: http://arxiv.org/abs/2508.13000v1
Date: Mon, 18 Aug 2025 15:18:59 GMT
Title: Omni Survey for Multimodality Analysis in Visual Object Tracking
Authors: Zhangyong Tang, Tianyang Xu, Xuefeng Zhu, Hui Li, Shaochuan Zhao, Tao Zhou, Chunyang Cheng, Xiaojun Wu, Josef Kittler,
Abstract summary: This paper surveys one of the most critical tasks, multi-modal visual object tracking (MMVOT)<n> MMVOT differs from single-modal tracking in four key aspects, data collection, modality alignment and annotation, model designing, and evaluation.
Score: 34.25429207685124
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The development of smart cities has led to the generation of massive amounts of multi-modal data in the context of a range of tasks that enable a comprehensive monitoring of the smart city infrastructure and services. This paper surveys one of the most critical tasks, multi-modal visual object tracking (MMVOT), from the perspective of multimodality analysis. Generally, MMVOT differs from single-modal tracking in four key aspects, data collection, modality alignment and annotation, model designing, and evaluation. Accordingly, we begin with an introduction to the relevant data modalities, laying the groundwork for their integration. This naturally leads to a discussion of challenges of multi-modal data collection, alignment, and annotation. Subsequently, existing MMVOT methods are categorised, based on different ways to deal with visible (RGB) and X modalities: programming the auxiliary X branch with replicated or non-replicated experimental configurations from the RGB branch. Here X can be thermal infrared (T), depth (D), event (E), near infrared (NIR), language (L), or sonar (S). The final part of the paper addresses evaluation and benchmarking. In summary, we undertake an omni survey of all aspects of multi-modal visual object tracking (VOT), covering six MMVOT tasks and featuring 338 references in total. In addition, we discuss the fundamental rhetorical question: Is multi-modal tracking always guaranteed to provide a superior solution to unimodal tracking with the help of information fusion, and if not, in what circumstances its application is beneficial. Furthermore, for the first time in this field, we analyse the distributions of the object categories in the existing MMVOT datasets, revealing their pronounced long-tail nature and a noticeable lack of animal categories when compared with RGB datasets.

Related papers

Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding [61.36285696607487]
Document understanding is critical for applications from financial analysis to scientific discovery.<n>Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs) face key limitations.<n>Retrieval-Augmented Generation (RAG) helps ground models in external data, but documents' multimodal nature, combining text, tables, charts, and layout, demands a more advanced paradigm: Multimodal RAG.
arXiv Detail & Related papers (2025-10-17T02:33:16Z)
SMART-Ship: A Comprehensive Synchronized Multi-modal Aligned Remote Sensing Targets Dataset and Benchmark for Berthed Ships Analysis [12.87083600993665]
This dataset consists of 1092 multi-modal image sets, covering 38,838 ships.<n>Each image set is acquired within one week and registered to ensuretemporal consistency.<n>We define benchmarks on five fundamental tasks and compare methods across the dataset.
arXiv Detail & Related papers (2025-08-04T13:09:58Z)
FusionTrack: End-to-End Multi-Object Tracking in Arbitrary Multi-View Environment [7.5152380894919055]
We propose an end-to-end framework that reasonably integrates tracking and re-identification to leverage multi-view information for robust trajectory association.<n>Experiments on our MDMOT and other benchmark datasets demonstrate that FusionTrack achieves state-of-the-art performance in both single-view and multi-view tracking.
arXiv Detail & Related papers (2025-05-24T14:51:19Z)
MM-OR: A Large Multimodal Operating Room Dataset for Semantic Understanding of High-Intensity Surgical Environments [49.45034796115852]
Operating rooms (ORs) are complex, high-stakes environments requiring precise understanding of interactions among medical staff, tools, and equipment.<n>Current datasets fall short in scale, realism and do not capture the nature of OR scenes, limiting multimodal in OR modeling.<n>We introduce MM-OR, a realistic and large-scale multimodal OR dataset, and first dataset to enable multimodal scene graph generation.
arXiv Detail & Related papers (2025-03-04T13:00:52Z)
Composed Multi-modal Retrieval: A Survey of Approaches and Applications [81.54640206021757]
Composed Multi-modal Retrieval (CMR) emerges as a pivotal next-generation technology.<n>CMR enables users to query images or videos by integrating a reference visual input with textual modifications.<n>This paper provides a comprehensive survey of CMR, covering its fundamental challenges, technical advancements, and applications.
arXiv Detail & Related papers (2025-03-03T09:18:43Z)
Visual Object Tracking across Diverse Data Modalities: A Review [33.006051781123595]
Visual Object Tracking (VOT) is an attractive and significant research area in computer vision.<n>We first review three types of mainstream single-modal VOT, including RGB, thermal infrared and point cloud tracking.<n>Then we summarize four kinds of multi-modal VOT, including RGB-Depth, RGB-Thermal, RGB-LiDAR and RGB-Language.
arXiv Detail & Related papers (2024-12-13T09:25:18Z)
Awesome Multi-modal Object Tracking [41.76977058932557]
Multi-modal object tracking (MMOT) is an emerging field that combines data from various modalities to estimate the state of an arbitrary object in a video sequence. To track the latest progress in MMOT, we conduct a comprehensive investigation in this report.
arXiv Detail & Related papers (2024-05-23T05:58:10Z)
MMRDN: Consistent Representation for Multi-View Manipulation Relationship Detection in Object-Stacked Scenes [62.20046129613934]
We propose a novel multi-view fusion framework, namely multi-view MRD network (MMRDN) We project the 2D data from different views into a common hidden space and fit the embeddings with a set of Von-Mises-Fisher distributions. We select a set of $K$ Maximum Vertical Neighbors (KMVN) points from the point cloud of each object pair, which encodes the relative position of these two objects.
arXiv Detail & Related papers (2023-04-25T05:55:29Z)
End-to-end Tracking with a Multi-query Transformer [96.13468602635082]
Multiple-object tracking (MOT) is a challenging task that requires simultaneous reasoning about location, appearance, and identity of the objects in the scene over time. Our aim in this paper is to move beyond tracking-by-detection approaches, to class-agnostic tracking that performs well also for unknown object classes.
arXiv Detail & Related papers (2022-10-26T10:19:37Z)
Visible-Thermal UAV Tracking: A Large-Scale Benchmark and New Baseline [80.13652104204691]
In this paper, we construct a large-scale benchmark with high diversity for visible-thermal UAV tracking (VTUAV) We provide a coarse-to-fine attribute annotation, where frame-level attributes are provided to exploit the potential of challenge-specific trackers. In addition, we design a new RGB-T baseline, named Hierarchical Multi-modal Fusion Tracker (HMFT), which fuses RGB-T data in various levels.
arXiv Detail & Related papers (2022-04-08T15:22:33Z)
The Multimodal Sentiment Analysis in Car Reviews (MuSe-CaR) Dataset: Collection, Insights and Improvements [14.707930573950787]
We present MuSe-CaR, a first of its kind multimodal dataset. The data is publicly available as it recently served as the testing bed for the 1st Multimodal Sentiment Analysis Challenge.
arXiv Detail & Related papers (2021-01-15T10:40:37Z)
Multi-modal Visual Tracking: Review and Experimental Comparison [85.20414397784937]
We summarize the multi-modal tracking algorithms, especially visible-depth (RGB-D) tracking and visible-thermal (RGB-T) tracking. We conduct experiments to analyze the effectiveness of trackers on five datasets.
arXiv Detail & Related papers (2020-12-08T02:39:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.