Related papers: EchoTrack: Auditory Referring Multi-Object Tracking for Autonomous Driving

EchoTrack: Auditory Referring Multi-Object Tracking for Autonomous Driving

URL: http://arxiv.org/abs/2402.18302v2
Date: Mon, 5 Aug 2024 23:48:12 GMT
Title: EchoTrack: Auditory Referring Multi-Object Tracking for Autonomous Driving
Authors: Jiacheng Lin, Jiajun Chen, Kunyu Peng, Xuan He, Zhiyong Li, Rainer Stiefelhagen, Kailun Yang,
Abstract summary: Auditory Referring Multi-Object Tracking (AR-MOT) is a challenging problem in autonomous driving. We put forward EchoTrack, an end-to-end AR-MOT framework with dual-stream vision transformers. We establish the first set of large-scale AR-MOT benchmarks.
Score: 64.58258341591929
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper introduces the task of Auditory Referring Multi-Object Tracking (AR-MOT), which dynamically tracks specific objects in a video sequence based on audio expressions and appears as a challenging problem in autonomous driving. Due to the lack of semantic modeling capacity in audio and video, existing works have mainly focused on text-based multi-object tracking, which often comes at the cost of tracking quality, interaction efficiency, and even the safety of assistance systems, limiting the application of such methods in autonomous driving. In this paper, we delve into the problem of AR-MOT from the perspective of audio-video fusion and audio-video tracking. We put forward EchoTrack, an end-to-end AR-MOT framework with dual-stream vision transformers. The dual streams are intertwined with our Bidirectional Frequency-domain Cross-attention Fusion Module (Bi-FCFM), which bidirectionally fuses audio and video features from both frequency- and spatiotemporal domains. Moreover, we propose the Audio-visual Contrastive Tracking Learning (ACTL) regime to extract homogeneous semantic features between expressions and visual objects by learning homogeneous features between different audio and video objects effectively. Aside from the architectural design, we establish the first set of large-scale AR-MOT benchmarks, including Echo-KITTI, Echo-KITTI+, and Echo-BDD. Extensive experiments on the established benchmarks demonstrate the effectiveness of the proposed EchoTrack and its components. The source code and datasets are available at https://github.com/lab206/EchoTrack.

Related papers

Tracking the Unstable: Appearance-Guided Motion Modeling for Robust Multi-Object Tracking in UAV-Captured Videos [58.156141601478794]
Multi-object tracking (UAVT) aims to track multiple objects while maintaining consistent identities across frames of a given video.<n>Existing methods typically model motion cues and appearance separately, overlooking their interplay and resulting in suboptimal tracking performance.<n>We propose AMOT, which exploits appearance and motion cues through two key components: an Appearance-Motion Consistency (AMC) matrix and a Motion-aware Track Continuation (MTC) module.
arXiv Detail & Related papers (2025-08-03T12:06:47Z)
CAMELTrack: Context-Aware Multi-cue ExpLoitation for Online Multi-Object Tracking [68.24998698508344]
We introduce CAMEL, a novel association module for Context-Aware Multi-Cue ExpLoitation.<n>Unlike end-to-end detection-by-tracking approaches, our method remains lightweight and fast to train while being able to leverage external off-the-shelf models.<n>Our proposed online tracking pipeline, CAMELTrack, achieves state-of-the-art performance on multiple tracking benchmarks.
arXiv Detail & Related papers (2025-05-02T13:26:23Z)
HSTrack: Bootstrap End-to-End Multi-Camera 3D Multi-object Tracking with Hybrid Supervision [34.7347336548199]
In camera-based 3D multi-object tracking (MOT), the prevailing methods follow the tracking-by-query-propagation paradigm. We present HSTrack, a novel plug-and-play method designed to co-facilitate multi-task learning for detection and tracking.
arXiv Detail & Related papers (2024-11-11T08:18:49Z)
VOVTrack: Exploring the Potentiality in Videos for Open-Vocabulary Object Tracking [61.56592503861093]
This issue amalgamates the complexities of open-vocabulary object detection (OVD) and multi-object tracking (MOT) Existing approaches to OVMOT often merge OVD and MOT methodologies as separate modules, predominantly focusing on the problem through an image-centric lens. We propose VOVTrack, a novel method that integrates object states relevant to MOT and video-centric training to address this challenge from a video object tracking standpoint.
arXiv Detail & Related papers (2024-10-11T05:01:49Z)
STNet: Deep Audio-Visual Fusion Network for Robust Speaker Tracking [8.238662377845142]
We present a novel Speaker Tracking Network (STNet) with a deep audio-visual fusion model in this work. Experiments on the AV16.3 and CAV3D datasets show that the proposed STNet-based tracker outperforms uni-modal methods and state-of-the-art audio-visual speaker trackers.
arXiv Detail & Related papers (2024-10-08T12:15:17Z)
TIM: A Time Interval Machine for Audio-Visual Action Recognition [64.24297230981168]
We address the interplay between the two modalities in long videos by explicitly modelling the temporal extents of audio and visual events. We propose the Time Interval Machine (TIM) where a modality-specific time interval poses as a query to a transformer encoder. We test TIM on three long audio-visual video datasets: EPIC-KITCHENS, Perception Test, and AVE.
arXiv Detail & Related papers (2024-04-08T14:30:42Z)
Improving Audio-Visual Segmentation with Bidirectional Generation [40.78395709407226]
We introduce a bidirectional generation framework for audio-visual segmentation. This framework establishes robust correlations between an object's visual characteristics and its associated sound. We also introduce an implicit volumetric motion estimation module to handle temporal dynamics.
arXiv Detail & Related papers (2023-08-16T11:20:23Z)
Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation [54.58405154065508]
We propose a Multi-modal Unified Temporal transformer for Referring video object segmentation. With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference. For high-level temporal interaction after the transformer, we conduct inter-frame feature communication for different object embeddings, contributing to better object-wise correspondence for tracking along the video.
arXiv Detail & Related papers (2023-05-25T17:59:47Z)
InterTrack: Interaction Transformer for 3D Multi-Object Tracking [9.283656931246645]
3D multi-object tracking (MOT) is a key problem for autonomous vehicles. Our proposed solution, InterTrack, generates discriminative object representations for data association. We validate our approach on the nuScenes 3D MOT benchmark, where we observe significant improvements.
arXiv Detail & Related papers (2022-08-17T03:24:36Z)
Unified Transformer Tracker for Object Tracking [58.65901124158068]
We present the Unified Transformer Tracker (UTT) to address tracking problems in different scenarios with one paradigm. A track transformer is developed in our UTT to track the target in both Single Object Tracking (SOT) and Multiple Object Tracking (MOT)
arXiv Detail & Related papers (2022-03-29T01:38:49Z)
Distractor-Aware Fast Tracking via Dynamic Convolutions and MOT Philosophy [63.91005999481061]
A practical long-term tracker typically contains three key properties, i.e. an efficient model design, an effective global re-detection strategy and a robust distractor awareness mechanism. We propose a two-task tracking frame work (named DMTrack) to achieve distractor-aware fast tracking via Dynamic convolutions (d-convs) and Multiple object tracking (MOT) philosophy. Our tracker achieves state-of-the-art performance on the LaSOT, OxUvA, TLP, VOT2018LT and VOT 2019LT benchmarks and runs in real-time (3x faster
arXiv Detail & Related papers (2021-04-25T00:59:53Z)
Visually Guided Sound Source Separation and Localization using Self-Supervised Motion Representations [16.447597767676655]
We aim to pinpoint the source location in the input video sequence. Recent works have shown impressive audio-visual separation results when using prior knowledge of the source type. We propose a two-stage architecture, called Appearance and Motion network (AMnet), where the stages specialise to appearance and motion cues.
arXiv Detail & Related papers (2021-04-17T10:09:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.