Steering Deep Non-Linear Spatially Selective Filters for Weakly Guided Extraction of Moving Speakers in Dynamic Scenarios
- URL: http://arxiv.org/abs/2505.14517v1
- Date: Tue, 20 May 2025 15:43:55 GMT
- Title: Steering Deep Non-Linear Spatially Selective Filters for Weakly Guided Extraction of Moving Speakers in Dynamic Scenarios
- Authors: Jakob Kienegger, Timo Gerkmann,
- Abstract summary: spatially dynamic scenarios are considerably more challenging due to time-varying spatial features and arising ambiguities.<n>We propose a weakly guided extraction method solely depending on the target's initial position to cope with spatial dynamic scenarios.<n>By incorporating our own deep tracking algorithm and developing a joint training strategy on a synthetic dataset, we demonstrate the proficiency of our approach in resolving spatial ambiguities.
- Score: 15.736484513462973
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent speaker extraction methods using deep non-linear spatial filtering perform exceptionally well when the target direction is known and stationary. However, spatially dynamic scenarios are considerably more challenging due to time-varying spatial features and arising ambiguities, e.g. when moving speakers cross. While in a static scenario it may be easy for a user to point to the target's direction, manually tracking a moving speaker is impractical. Instead of relying on accurate time-dependent directional cues, which we refer to as strong guidance, in this paper we propose a weakly guided extraction method solely depending on the target's initial position to cope with spatial dynamic scenarios. By incorporating our own deep tracking algorithm and developing a joint training strategy on a synthetic dataset, we demonstrate the proficiency of our approach in resolving spatial ambiguities and even outperform a mismatched, but strongly guided extraction method.
Related papers
- Self-Steering Deep Non-Linear Spatially Selective Filters for Efficient Extraction of Moving Speakers under Weak Guidance [14.16697537117357]
We present a novel strategy utilizing a low-complexity tracking algorithm in the form of a particle filter instead.<n>We show how the autoregressive interplay between both algorithms drastically improves tracking accuracy and leads to strong enhancement performance.
arXiv Detail & Related papers (2025-07-03T16:54:56Z) - Seurat: From Moving Points to Depth [66.65189052568209]
We propose a novel method that infers relative depth by examining the spatial relationships and temporal evolution of a set of tracked 2D trajectories.<n>Our approach achieves temporally smooth, high-accuracy depth predictions across diverse domains.
arXiv Detail & Related papers (2025-04-20T17:37:02Z) - Learning Where to Look: Self-supervised Viewpoint Selection for Active Localization using Geometrical Information [68.10033984296247]
This paper explores the domain of active localization, emphasizing the importance of viewpoint selection to enhance localization accuracy.
Our contributions involve using a data-driven approach with a simple architecture designed for real-time operation, a self-supervised data training method, and the capability to consistently integrate our map into a planning framework tailored for real-world robotics applications.
arXiv Detail & Related papers (2024-07-22T12:32:09Z) - Self-Supervised Class-Agnostic Motion Prediction with Spatial and Temporal Consistency Regularizations [53.797896854533384]
Class-agnostic motion prediction methods directly predict the motion of the entire point cloud.
While most existing methods rely on fully-supervised learning, the manual labeling of point cloud data is laborious and time-consuming.
We introduce three simple spatial and temporal regularization losses, which facilitate the self-supervised training process effectively.
arXiv Detail & Related papers (2024-03-20T02:58:45Z) - Attention-Driven Multichannel Speech Enhancement in Moving Sound Source
Scenarios [11.811571392419324]
Speech enhancement algorithms typically assume a stationary sound source, a common mismatch with reality that limits their performance in real-world scenarios.
This paper focuses on attention-driven spatial filtering techniques designed for dynamic settings.
arXiv Detail & Related papers (2023-12-17T16:12:35Z) - Learning Representative Trajectories of Dynamical Systems via
Domain-Adaptive Imitation [0.0]
We propose DATI, a deep reinforcement learning agent designed for domain-adaptive trajectory imitation.
Our experiments show that DATI outperforms baseline methods for imitation learning and optimal control in this setting.
Its generalization to a real-world scenario is shown through the discovery of abnormal motion patterns in maritime traffic.
arXiv Detail & Related papers (2023-04-19T15:53:48Z) - Spatially Selective Deep Non-linear Filters for Speaker Extraction [21.422488450492434]
We develop a deep joint spatial-spectral non-linear filter that can be steered in an arbitrary target direction.
We show that this scheme is more effective than the baseline approach and increases the flexibility of the filter at no performance cost.
arXiv Detail & Related papers (2022-11-04T12:54:06Z) - Pre-training General Trajectory Embeddings with Maximum Multi-view
Entropy Coding [36.18788551389281]
Trajectory embeddings can improve task performance but may incur high computational costs and face limited training data availability.
Existing trajectory embedding methods face difficulties in learning general embeddings due to biases towards certain downstream tasks.
We propose Multi-view Trajectory Entropy Coding Coding (MMTEC) for learning general comprehensive trajectory embeddings.
arXiv Detail & Related papers (2022-07-29T08:16:20Z) - Deep Shells: Unsupervised Shape Correspondence with Optimal Transport [52.646396621449]
We propose a novel unsupervised learning approach to 3D shape correspondence.
We show that the proposed method significantly improves over the state-of-the-art on multiple datasets.
arXiv Detail & Related papers (2020-10-28T22:24:07Z) - AutoTrajectory: Label-free Trajectory Extraction and Prediction from
Videos using Dynamic Points [92.91569287889203]
We present a novel, label-free algorithm, AutoTrajectory, for trajectory extraction and prediction.
To better capture the moving objects in videos, we introduce dynamic points.
We aggregate dynamic points to instance points, which stand for moving objects such as pedestrians in videos.
arXiv Detail & Related papers (2020-07-11T08:43:34Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.