EchoTrack: Auditory Referring Multi-Object Tracking for Autonomous
Driving
- URL: http://arxiv.org/abs/2402.18302v1
- Date: Wed, 28 Feb 2024 12:50:16 GMT
- Title: EchoTrack: Auditory Referring Multi-Object Tracking for Autonomous
Driving
- Authors: Jiacheng Lin, Jiajun Chen, Kunyu Peng, Xuan He, Zhiyong Li, Rainer
Stiefelhagen, Kailun Yang
- Abstract summary: Auditory Referring Multi-Object Tracking (AR-MOT) is a challenging problem in autonomous driving.
Due to the lack of semantic modeling capacity in audio and video, existing works have mainly focused on text-based multi-object tracking.
We put forward EchoTrack, an end-to-end AR-MOT framework with dual-stream vision transformers.
- Score: 67.82112360246025
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces the task of Auditory Referring Multi-Object Tracking
(AR-MOT), which dynamically tracks specific objects in a video sequence based
on audio expressions and appears as a challenging problem in autonomous
driving. Due to the lack of semantic modeling capacity in audio and video,
existing works have mainly focused on text-based multi-object tracking, which
often comes at the cost of tracking quality, interaction efficiency, and even
the safety of assistance systems, limiting the application of such methods in
autonomous driving. In this paper, we delve into the problem of AR-MOT from the
perspective of audio-video fusion and audio-video tracking. We put forward
EchoTrack, an end-to-end AR-MOT framework with dual-stream vision transformers.
The dual streams are intertwined with our Bidirectional Frequency-domain
Cross-attention Fusion Module (Bi-FCFM), which bidirectionally fuses audio and
video features from both frequency- and spatiotemporal domains. Moreover, we
propose the Audio-visual Contrastive Tracking Learning (ACTL) regime to extract
homogeneous semantic features between expressions and visual objects by
learning homogeneous features between different audio and video objects
effectively. Aside from the architectural design, we establish the first set of
large-scale AR-MOT benchmarks, including Echo-KITTI, Echo-KITTI+, and Echo-BDD.
Extensive experiments on the established benchmarks demonstrate the
effectiveness of the proposed EchoTrack model and its components. The source
code and datasets will be made publicly available at
https://github.com/lab206/EchoTrack.
Related papers
- VOVTrack: Exploring the Potentiality in Videos for Open-Vocabulary Object Tracking [61.56592503861093]
This issue amalgamates the complexities of open-vocabulary object detection (OVD) and multi-object tracking (MOT)
Existing approaches to OVMOT often merge OVD and MOT methodologies as separate modules, predominantly focusing on the problem through an image-centric lens.
We propose VOVTrack, a novel method that integrates object states relevant to MOT and video-centric training to address this challenge from a video object tracking standpoint.
arXiv Detail & Related papers (2024-10-11T05:01:49Z) - STNet: Deep Audio-Visual Fusion Network for Robust Speaker Tracking [8.238662377845142]
We present a novel Speaker Tracking Network (STNet) with a deep audio-visual fusion model in this work.
Experiments on the AV16.3 and CAV3D datasets show that the proposed STNet-based tracker outperforms uni-modal methods and state-of-the-art audio-visual speaker trackers.
arXiv Detail & Related papers (2024-10-08T12:15:17Z) - TIM: A Time Interval Machine for Audio-Visual Action Recognition [64.24297230981168]
We address the interplay between the two modalities in long videos by explicitly modelling the temporal extents of audio and visual events.
We propose the Time Interval Machine (TIM) where a modality-specific time interval poses as a query to a transformer encoder.
We test TIM on three long audio-visual video datasets: EPIC-KITCHENS, Perception Test, and AVE.
arXiv Detail & Related papers (2024-04-08T14:30:42Z) - Improving Audio-Visual Segmentation with Bidirectional Generation [40.78395709407226]
We introduce a bidirectional generation framework for audio-visual segmentation.
This framework establishes robust correlations between an object's visual characteristics and its associated sound.
We also introduce an implicit volumetric motion estimation module to handle temporal dynamics.
arXiv Detail & Related papers (2023-08-16T11:20:23Z) - Referred by Multi-Modality: A Unified Temporal Transformer for Video
Object Segmentation [54.58405154065508]
We propose a Multi-modal Unified Temporal transformer for Referring video object segmentation.
With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference.
For high-level temporal interaction after the transformer, we conduct inter-frame feature communication for different object embeddings, contributing to better object-wise correspondence for tracking along the video.
arXiv Detail & Related papers (2023-05-25T17:59:47Z) - InterTrack: Interaction Transformer for 3D Multi-Object Tracking [9.283656931246645]
3D multi-object tracking (MOT) is a key problem for autonomous vehicles.
Our proposed solution, InterTrack, generates discriminative object representations for data association.
We validate our approach on the nuScenes 3D MOT benchmark, where we observe significant improvements.
arXiv Detail & Related papers (2022-08-17T03:24:36Z) - Unified Transformer Tracker for Object Tracking [58.65901124158068]
We present the Unified Transformer Tracker (UTT) to address tracking problems in different scenarios with one paradigm.
A track transformer is developed in our UTT to track the target in both Single Object Tracking (SOT) and Multiple Object Tracking (MOT)
arXiv Detail & Related papers (2022-03-29T01:38:49Z) - Distractor-Aware Fast Tracking via Dynamic Convolutions and MOT
Philosophy [63.91005999481061]
A practical long-term tracker typically contains three key properties, i.e. an efficient model design, an effective global re-detection strategy and a robust distractor awareness mechanism.
We propose a two-task tracking frame work (named DMTrack) to achieve distractor-aware fast tracking via Dynamic convolutions (d-convs) and Multiple object tracking (MOT) philosophy.
Our tracker achieves state-of-the-art performance on the LaSOT, OxUvA, TLP, VOT2018LT and VOT 2019LT benchmarks and runs in real-time (3x faster
arXiv Detail & Related papers (2021-04-25T00:59:53Z) - Visually Guided Sound Source Separation and Localization using
Self-Supervised Motion Representations [16.447597767676655]
We aim to pinpoint the source location in the input video sequence.
Recent works have shown impressive audio-visual separation results when using prior knowledge of the source type.
We propose a two-stage architecture, called Appearance and Motion network (AMnet), where the stages specialise to appearance and motion cues.
arXiv Detail & Related papers (2021-04-17T10:09:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.