Related papers: Real-Time Object Tracking with On-Device Deep Learning for Adaptive Beamforming in Dynamic Acoustic Environments

Real-Time Object Tracking with On-Device Deep Learning for Adaptive Beamforming in Dynamic Acoustic Environments

URL: http://arxiv.org/abs/2511.19396v1
Date: Mon, 24 Nov 2025 18:33:50 GMT
Title: Real-Time Object Tracking with On-Device Deep Learning for Adaptive Beamforming in Dynamic Acoustic Environments
Authors: Jorge Ortigoso-Narro, Jose A. Belloch, Adrian Amor-Martin, Sandra Roger, Maximo Cobos,
Abstract summary: This work presents an embedded system that integrates deep learning-based tracking with beamforming to achieve precise sound source localization.<n>The system is well-suited for teleconferencing, smart home devices, and assistive technologies.
Score: 3.0718743078604067
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Advances in object tracking and acoustic beamforming are driving new capabilities in surveillance, human-computer interaction, and robotics. This work presents an embedded system that integrates deep learning-based tracking with beamforming to achieve precise sound source localization and directional audio capture in dynamic environments. The approach combines single-camera depth estimation and stereo vision to enable accurate 3D localization of moving objects. A planar concentric circular microphone array constructed with MEMS microphones provides a compact, energy-efficient platform supporting 2D beam steering across azimuth and elevation. Real-time tracking outputs continuously adapt the array's focus, synchronizing the acoustic response with the target's position. By uniting learned spatial awareness with dynamic steering, the system maintains robust performance in the presence of multiple or moving sources. Experimental evaluation demonstrates significant gains in signal-to-interference ratio, making the design well-suited for teleconferencing, smart home devices, and assistive technologies.

Related papers

Stereo-Inertial Poser: Towards Metric-Accurate Shape-Aware Motion Capture Using Sparse IMUs and a Single Stereo Camera [54.967647497048205]
We present Stereo-Inertial Poser, a real-time motion capture system that estimates metric-accurate and shape-aware 3D human motion.<n>We replace the monocular RGB with stereo vision, enabling direct 3D keypoint extraction and body shape parameter estimation.<n>Our method produces drift-free global translation under a long recording time and reduces foot-skating effects.
arXiv Detail & Related papers (2026-03-02T17:46:38Z)
Audio-Visual World Models: Towards Multisensory Imagination in Sight and Sound [5.591620304505415]
This work presents the first formal framework for Audio-Visual World Models (AVWM)<n>It formulates multimodal environment simulation as a partially observable decision process with audio-visual observations, fine-grained actions, and task rewards.<n>We propose an Audio-Visual Conditional Transformer with a novel modality expert architecture that balances visual and auditory learning.
arXiv Detail & Related papers (2025-11-30T13:11:56Z)
Audio-Guided Dynamic Modality Fusion with Stereo-Aware Attention for Audio-Visual Navigation [41.85539404067887]
In audio-visual navigation (AVN) tasks, an embodied agent must autonomously localize a sound source in complex 3D environments.<n>Existing methods often rely on static modality fusion strategies and neglect the spatial cues embedded in stereo audio.<n>We propose an end-to-end reinforcement learning-based AVN framework with two key innovations.
arXiv Detail & Related papers (2025-09-21T05:11:09Z)
Wandering around: A bioinspired approach to visual attention through object motion sensitivity [40.966228784674115]
Active vision enables dynamic visual perception, offering an alternative to static feedforward architectures in computer vision.<n>Event-based cameras, inspired by the mammalian retina, enhance this capability by capturing asynchronous scene changes.<n>To distinguish moving objects while the event-based camera is in motion the agent requires an object motion segmentation mechanism.<n>This work presents a Convolutional Neural Network bio-inspired attention system for selective attention through object motion sensitivity.
arXiv Detail & Related papers (2025-02-10T18:16:30Z)
SoundLoc3D: Invisible 3D Sound Source Localization and Classification Using a Multimodal RGB-D Acoustic Camera [61.642416712939095]
SoundLoc3D treats the task as a set prediction problem, each element in the set corresponds to a potential sound source.<n>We demonstrate the efficiency and superiority of SoundLoc3D on large-scale simulated dataset.
arXiv Detail & Related papers (2024-12-22T05:04:17Z)
Open-World Drone Active Tracking with Goal-Centered Rewards [62.21394499788672]
Drone Visual Active Tracking aims to autonomously follow a target object by controlling the motion system based on visual observations.<n>We propose DAT, the first open-world drone active air-to-ground tracking benchmark.<n>We also propose GC-VAT, which aims to improve the performance of drone tracking targets in complex scenarios.
arXiv Detail & Related papers (2024-12-01T09:37:46Z)
I-MPN: Inductive Message Passing Network for Efficient Human-in-the-Loop Annotation of Mobile Eye Tracking Data [4.487146086221174]
We present a novel human-centered learning algorithm designed for automated object recognition within mobile eye-tracking settings. Our approach seamlessly integrates an object detector with a spatial relation-aware inductive message-passing network (I-MPN), harnessing node profile information and capturing object correlations.
arXiv Detail & Related papers (2024-06-10T13:08:31Z)
EchoTrack: Auditory Referring Multi-Object Tracking for Autonomous Driving [64.58258341591929]
Auditory Referring Multi-Object Tracking (AR-MOT) is a challenging problem in autonomous driving. We put forward EchoTrack, an end-to-end AR-MOT framework with dual-stream vision transformers. We establish the first set of large-scale AR-MOT benchmarks.
arXiv Detail & Related papers (2024-02-28T12:50:16Z)
AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core. The 3D autodecoder framework embeds properties learned from the target dataset in the latent space. We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z)
DynamicStereo: Consistent Dynamic Depth from Stereo Videos [91.1804971397608]
We propose DynamicStereo to estimate disparity for stereo videos. The network learns to pool information from neighboring frames to improve the temporal consistency of its predictions. We also introduce Dynamic Replica, a new benchmark dataset containing synthetic videos of people and animals in scanned environments.
arXiv Detail & Related papers (2023-05-03T17:40:49Z)
AcousticFusion: Fusing Sound Source Localization to Visual SLAM in Dynamic Environments [19.413143126734383]
We propose a novel audio-visual fusion approach that fuses sound source direction into the RGB-D image. The proposed method uses very small computational resources to obtain very stable self-localization results.
arXiv Detail & Related papers (2021-08-03T02:10:26Z)
Learning to Set Waypoints for Audio-Visual Navigation [89.42192208471735]
In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source. Existing models learn to act at a fixed granularity of agent motion and rely on simple recurrent aggregations of the audio observations. We introduce a reinforcement learning approach to audio-visual navigation with two key novel elements.
arXiv Detail & Related papers (2020-08-21T18:00:33Z)
Event-based Stereo Visual Odometry [42.77238738150496]
We present a solution to the problem of visual odometry from the data acquired by a stereo event-based camera rig. We seek to maximize thetemporal consistency of stereo event-based data while using a simple and efficient representation.
arXiv Detail & Related papers (2020-07-30T15:53:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.