Cross-Modal Object Tracking: Modality-Aware Representations and A
Unified Benchmark
- URL: http://arxiv.org/abs/2111.04264v2
- Date: Thu, 11 Nov 2021 08:30:58 GMT
- Title: Cross-Modal Object Tracking: Modality-Aware Representations and A
Unified Benchmark
- Authors: Chenglong Li, Tianhao Zhu, Lei Liu, Xiaonan Si, Zilin Fan, Sulan Zhai
- Abstract summary: In many visual systems, visual tracking often bases on RGB image sequences, in which some targets are invalid in low-light conditions.
We propose a new algorithm, which learns the modality-aware target representation to mitigate the appearance gap between RGB and NIR modalities in the tracking process.
We will release the dataset for free academic usage, dataset download link and code will be released soon.
- Score: 8.932487291107812
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In many visual systems, visual tracking often bases on RGB image sequences,
in which some targets are invalid in low-light conditions, and tracking
performance is thus affected significantly. Introducing other modalities such
as depth and infrared data is an effective way to handle imaging limitations of
individual sources, but multi-modal imaging platforms usually require elaborate
designs and cannot be applied in many real-world applications at present.
Near-infrared (NIR) imaging becomes an essential part of many surveillance
cameras, whose imaging is switchable between RGB and NIR based on the light
intensity. These two modalities are heterogeneous with very different visual
properties and thus bring big challenges for visual tracking. However, existing
works have not studied this challenging problem. In this work, we address the
cross-modal object tracking problem and contribute a new video dataset,
including 654 cross-modal image sequences with over 481K frames in total, and
the average video length is more than 735 frames. To promote the research and
development of cross-modal object tracking, we propose a new algorithm, which
learns the modality-aware target representation to mitigate the appearance gap
between RGB and NIR modalities in the tracking process. It is plug-and-play and
could thus be flexibly embedded into different tracking frameworks. Extensive
experiments on the dataset are conducted, and we demonstrate the effectiveness
of the proposed algorithm in two representative tracking frameworks against 17
state-of-the-art tracking methods. We will release the dataset for free
academic usage, dataset download link and code will be released soon.
Related papers
- CRSOT: Cross-Resolution Object Tracking using Unaligned Frame and Event
Cameras [43.699819213559515]
Existing datasets for RGB-DVS tracking are collected with DVS346 camera and their resolution ($346 times 260$) is low for practical applications.
We build the first unaligned frame-event dataset CRSOT collected with a specially built data acquisition system.
We propose a novel unaligned object tracking framework that can realize robust tracking even using the loosely aligned RGB-Event data.
arXiv Detail & Related papers (2024-01-05T14:20:22Z) - Cross-Modal Object Tracking via Modality-Aware Fusion Network and A
Large-Scale Dataset [20.729414075628814]
We propose an adaptive cross-modal object tracking algorithm called Modality-Aware Fusion Network (MAFNet)
MAFNet efficiently integrates information from both RGB and NIR modalities using an adaptive weighting mechanism.
arXiv Detail & Related papers (2023-12-22T05:22:33Z) - Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - Single-Model and Any-Modality for Video Object Tracking [85.83753760853142]
We introduce Un-Track, a Unified Tracker of a single set of parameters for any modality.
To handle any modality, our method learns their common latent space through low-rank factorization and reconstruction techniques.
Our Un-Track achieves +8.1 absolute F-score gain, on the DepthTrack dataset, by introducing only +2.14 (over 21.50) GFLOPs with +6.6M (over 93M) parameters.
arXiv Detail & Related papers (2023-11-27T14:17:41Z) - mEBAL2 Database and Benchmark: Image-based Multispectral Eyeblink Detection [14.052943954940758]
This work introduces a new multispectral database and novel approaches for eyeblink detection in RGB and Near-Infrared (NIR) individual images.
mEBAL2 is the largest existing eyeblink database.
mEBAL2 includes 21,100 image sequences from 180 different students.
arXiv Detail & Related papers (2023-09-14T17:25:25Z) - Diverse Embedding Expansion Network and Low-Light Cross-Modality
Benchmark for Visible-Infrared Person Re-identification [26.71900654115498]
We propose a novel augmentation network in the embedding space, called diverse embedding expansion network (DEEN)
The proposed DEEN can effectively generate diverse embeddings to learn the informative feature representations.
We provide a low-light cross-modality (LLCM) dataset, which contains 46,767 bounding boxes of 1,064 identities captured by 9 RGB/IR cameras.
arXiv Detail & Related papers (2023-03-25T14:24:56Z) - Learning Dual-Fused Modality-Aware Representations for RGBD Tracking [67.14537242378988]
Compared with the traditional RGB object tracking, the addition of the depth modality can effectively solve the target and background interference.
Some existing RGBD trackers use the two modalities separately and thus some particularly useful shared information between them is ignored.
We propose a novel Dual-fused Modality-aware Tracker (termed DMTracker) which aims to learn informative and discriminative representations of the target objects for robust RGBD tracking.
arXiv Detail & Related papers (2022-11-06T07:59:07Z) - Learning Modal-Invariant and Temporal-Memory for Video-based
Visible-Infrared Person Re-Identification [46.49866514866999]
We primarily study the video-based cross-modal person Re-ID method.
We prove that with the increase of frames in a tracklet, the performance does meet more enhancement.
A novel method is proposed, which projects two modalities to a modal-invariant subspace.
arXiv Detail & Related papers (2022-08-04T04:43:52Z) - Scalable and Real-time Multi-Camera Vehicle Detection,
Re-Identification, and Tracking [58.95210121654722]
We propose a real-time city-scale multi-camera vehicle tracking system that handles real-world, low-resolution CCTV instead of idealized and curated video streams.
Our method is ranked among the top five performers on the public leaderboard.
arXiv Detail & Related papers (2022-04-15T12:47:01Z) - Visible-Thermal UAV Tracking: A Large-Scale Benchmark and New Baseline [80.13652104204691]
In this paper, we construct a large-scale benchmark with high diversity for visible-thermal UAV tracking (VTUAV)
We provide a coarse-to-fine attribute annotation, where frame-level attributes are provided to exploit the potential of challenge-specific trackers.
In addition, we design a new RGB-T baseline, named Hierarchical Multi-modal Fusion Tracker (HMFT), which fuses RGB-T data in various levels.
arXiv Detail & Related papers (2022-04-08T15:22:33Z) - Drone-based RGB-Infrared Cross-Modality Vehicle Detection via
Uncertainty-Aware Learning [59.19469551774703]
Drone-based vehicle detection aims at finding the vehicle locations and categories in an aerial image.
We construct a large-scale drone-based RGB-Infrared vehicle detection dataset, termed DroneVehicle.
Our DroneVehicle collects 28, 439 RGB-Infrared image pairs, covering urban roads, residential areas, parking lots, and other scenarios from day to night.
arXiv Detail & Related papers (2020-03-05T05:29:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.