CRSOT: Cross-Resolution Object Tracking using Unaligned Frame and Event
Cameras
- URL: http://arxiv.org/abs/2401.02826v1
- Date: Fri, 5 Jan 2024 14:20:22 GMT
- Title: CRSOT: Cross-Resolution Object Tracking using Unaligned Frame and Event
Cameras
- Authors: Yabin Zhu, Xiao Wang, Chenglong Li, Bo Jiang, Lin Zhu, Zhixiang Huang,
Yonghong Tian, Jin Tang
- Abstract summary: Existing datasets for RGB-DVS tracking are collected with DVS346 camera and their resolution ($346 times 260$) is low for practical applications.
We build the first unaligned frame-event dataset CRSOT collected with a specially built data acquisition system.
We propose a novel unaligned object tracking framework that can realize robust tracking even using the loosely aligned RGB-Event data.
- Score: 43.699819213559515
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing datasets for RGB-DVS tracking are collected with DVS346 camera and
their resolution ($346 \times 260$) is low for practical applications.
Actually, only visible cameras are deployed in many practical systems, and the
newly designed neuromorphic cameras may have different resolutions. The latest
neuromorphic sensors can output high-definition event streams, but it is very
difficult to achieve strict alignment between events and frames on both spatial
and temporal views. Therefore, how to achieve accurate tracking with unaligned
neuromorphic and visible sensors is a valuable but unresearched problem. In
this work, we formally propose the task of object tracking using unaligned
neuromorphic and visible cameras. We build the first unaligned frame-event
dataset CRSOT collected with a specially built data acquisition system, which
contains 1,030 high-definition RGB-Event video pairs, 304,974 video frames. In
addition, we propose a novel unaligned object tracking framework that can
realize robust tracking even using the loosely aligned RGB-Event data.
Specifically, we extract the template and search regions of RGB and Event data
and feed them into a unified ViT backbone for feature embedding. Then, we
propose uncertainty perception modules to encode the RGB and Event features,
respectively, then, we propose a modality uncertainty fusion module to
aggregate the two modalities. These three branches are jointly optimized in the
training phase. Extensive experiments demonstrate that our tracker can
collaborate the dual modalities for high-performance tracking even without
strictly temporal and spatial alignment. The source code, dataset, and
pre-trained models will be released at
https://github.com/Event-AHU/Cross_Resolution_SOT.
Related papers
- ViDSOD-100: A New Dataset and a Baseline Model for RGB-D Video Salient Object Detection [51.16181295385818]
We first collect an annotated RGB-D video SODOD (DSOD-100) dataset, which contains 100 videos within a total of 9,362 frames.
All the frames in each video are manually annotated to a high-quality saliency annotation.
We propose a new baseline model, named attentive triple-fusion network (ATF-Net) for RGB-D salient object detection.
arXiv Detail & Related papers (2024-06-18T12:09:43Z) - TENet: Targetness Entanglement Incorporating with Multi-Scale Pooling and Mutually-Guided Fusion for RGB-E Object Tracking [30.89375068036783]
Existing approaches perform event feature extraction for RGB-E tracking using traditional appearance models.
We propose an Event backbone (Pooler) to obtain a high-quality feature representation that is cognisant of the intrinsic characteristics of the event data.
Our method significantly outperforms state-of-the-art trackers on two widely used RGB-E tracking datasets.
arXiv Detail & Related papers (2024-05-08T12:19:08Z) - ARKitTrack: A New Diverse Dataset for Tracking Using Mobile RGB-D Data [75.73063721067608]
We propose a new RGB-D tracking dataset for both static and dynamic scenes captured by consumer-grade LiDAR scanners equipped on Apple's iPhone and iPad.
ARKitTrack contains 300 RGB-D sequences, 455 targets, and 229.7K video frames in total.
In-depth empirical analysis has verified that the ARKitTrack dataset can significantly facilitate RGB-D tracking and that the proposed baseline method compares favorably against the state of the arts.
arXiv Detail & Related papers (2023-03-24T09:51:13Z) - Dual Memory Aggregation Network for Event-Based Object Detection with
Learnable Representation [79.02808071245634]
Event-based cameras are bio-inspired sensors that capture brightness change of every pixel in an asynchronous manner.
Event streams are divided into grids in the x-y-t coordinates for both positive and negative polarity, producing a set of pillars as 3D tensor representation.
Long memory is encoded in the hidden state of adaptive convLSTMs while short memory is modeled by computing spatial-temporal correlation between event pillars.
arXiv Detail & Related papers (2023-03-17T12:12:41Z) - Tracking Fast by Learning Slow: An Event-based Speed Adaptive Hand
Tracker Leveraging Knowledge in RGB Domain [4.530678016396477]
3D hand tracking methods based on monocular RGB videos are easily affected by motion blur, while event camera, a sensor with high temporal resolution and dynamic range, is naturally suitable for this task with sparse output and low power consumption.
We developed an event-based speed adaptive hand tracker (ESAHT) to solve the hand tracking problem based on event camera.
Our solution outperformed RGB-based as well as previous event-based solutions in fast hand tracking tasks, and our codes and dataset will be publicly available.
arXiv Detail & Related papers (2023-02-28T09:18:48Z) - Learning Dual-Fused Modality-Aware Representations for RGBD Tracking [67.14537242378988]
Compared with the traditional RGB object tracking, the addition of the depth modality can effectively solve the target and background interference.
Some existing RGBD trackers use the two modalities separately and thus some particularly useful shared information between them is ignored.
We propose a novel Dual-fused Modality-aware Tracker (termed DMTracker) which aims to learn informative and discriminative representations of the target objects for robust RGBD tracking.
arXiv Detail & Related papers (2022-11-06T07:59:07Z) - Cross-Modal Object Tracking: Modality-Aware Representations and A
Unified Benchmark [8.932487291107812]
In many visual systems, visual tracking often bases on RGB image sequences, in which some targets are invalid in low-light conditions.
We propose a new algorithm, which learns the modality-aware target representation to mitigate the appearance gap between RGB and NIR modalities in the tracking process.
We will release the dataset for free academic usage, dataset download link and code will be released soon.
arXiv Detail & Related papers (2021-11-08T03:58:55Z) - VisEvent: Reliable Object Tracking via Collaboration of Frame and Event
Flows [93.54888104118822]
We propose a large-scale Visible-Event benchmark (termed VisEvent) due to the lack of a realistic and scaled dataset for this task.
Our dataset consists of 820 video pairs captured under low illumination, high speed, and background clutter scenarios.
Based on VisEvent, we transform the event flows into event images and construct more than 30 baseline methods.
arXiv Detail & Related papers (2021-08-11T03:55:12Z) - Relation3DMOT: Exploiting Deep Affinity for 3D Multi-Object Tracking
from View Aggregation [8.854112907350624]
3D multi-object tracking plays a vital role in autonomous navigation.
Many approaches detect objects in 2D RGB sequences for tracking, which is lack of reliability when localizing objects in 3D space.
We propose a novel convolutional operation, named RelationConv, to better exploit the correlation between each pair of objects in the adjacent frames.
arXiv Detail & Related papers (2020-11-25T16:14:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.