Related papers: LSZone: A Lightweight Spatial Information Modeling Architecture for Real-time In-car Multi-zone Speech Separation

LSZone: A Lightweight Spatial Information Modeling Architecture for Real-time In-car Multi-zone Speech Separation

URL: http://arxiv.org/abs/2510.10687v1
Date: Sun, 12 Oct 2025 16:31:05 GMT
Title: LSZone: A Lightweight Spatial Information Modeling Architecture for Real-time In-car Multi-zone Speech Separation
Authors: Jun Chen, Shichao Hu, Jiuxin Lin, Wenjie Li, Zihan Zhang, Xingchen Li, JinJiang Liu, Longshuai Xiao, Chao Weng, Lei Xie, Zhiyong Wu,
Abstract summary: In-car multi-zone speech separation plays a crucial role in human-vehicle interaction.<n>Previous SpatialNet has achieved notable results, but its high computational cost still hinders real-time applications in vehicles.<n>This paper proposes LSZone, a lightweight spatial information modeling architecture for real-time in-car multi-zone speech separation.
Score: 48.822698652567944
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: In-car multi-zone speech separation, which captures voices from different speech zones, plays a crucial role in human-vehicle interaction. Although previous SpatialNet has achieved notable results, its high computational cost still hinders real-time applications in vehicles. To this end, this paper proposes LSZone, a lightweight spatial information modeling architecture for real-time in-car multi-zone speech separation. We design a spatial information extraction-compression (SpaIEC) module that combines Mel spectrogram and Interaural Phase Difference (IPD) to reduce computational burden while maintaining performance. Additionally, to efficiently model spatial information, we introduce an extremely lightweight Conv-GRU crossband-narrowband processing (CNP) module. Experimental results demonstrate that LSZone, with a complexity of 0.56G MACs and a real-time factor (RTF) of 0.37, delivers impressive performance in complex noise and multi-speaker scenarios.

Related papers

SpatialEmb: Extract and Encode Spatial Information for 1-Stage Multi-channel Multi-speaker ASR on Arbitrary Microphone Arrays [45.93777164579776]
We propose a lightweight embedding module named SpatialEmb, which extracts and encodes spatial information directly for the ASR model.<n>We conduct comprehensive experiments on AliMeeting, a real meeting corpus, to determine the optimal model design for SpatialEmb.<n>Our best model trained with 105 hours achieves 17.04% and 20.32% character error rates (CER) on the Eval and Test sets.
arXiv Detail & Related papers (2026-01-25T23:21:49Z)
SPUR: A Plug-and-Play Framework for Integrating Spatial Audio Understanding and Reasoning into Large Audio-Language Models [62.14165748145729]
We introduce SPUR, a lightweight, plug-in approach that equips large audio-speaker models with spatial perception.<n>SPUR consists of: (i) a First-Order Ambisonics (FOA) encoder that maps channels to rotation-aware, listener-centric spatial features, integrated into target LALMs via a multimodal adapter; and (ii) SPUR-Set, a spatial QA dataset combining open-source FOA recordings with controlled simulations, emphasizing relative direction, elevation, distance, and overlap for supervised spatial reasoning.
arXiv Detail & Related papers (2025-11-10T01:29:26Z)
Complementary and Contrastive Learning for Audio-Visual Segmentation [74.11434759171199]
We present Complementary and Contrastive Transformer (CCFormer), a novel framework adept at processing both local and global information.<n>Our method sets new state-of-the-art benchmarks across the S4, MS3 and AVSS datasets.
arXiv Detail & Related papers (2025-10-11T06:36:59Z)
Curriculum Multi-Task Self-Supervision Improves Lightweight Architectures for Onboard Satellite Hyperspectral Image Segmentation [21.959448032308615]
Hyperspectral imaging (HSI) captures detailed spectral signatures across hundreds of contiguous bands per pixel.<n>We introduce a novel curriculum multi-task self-supervised learning framework designed for lightweight architectures for HSI analysis.<n>CMTSSL integrates masked image modeling with decoupled spatial and spectral jigsaw puzzle solving.
arXiv Detail & Related papers (2025-09-16T16:37:59Z)
FreSca: Scaling in Frequency Space Enhances Diffusion Models [55.75504192166779]
This paper explores frequency-based control within latent diffusion models.<n>We introduce FreSca, a novel framework that decomposes noise difference into low- and high-frequency components.<n>FreSca operates without any model retraining or architectural change, offering model- and task-agnostic control.
arXiv Detail & Related papers (2025-04-02T22:03:11Z)
CDXLSTM: Boosting Remote Sensing Change Detection with Extended Long Short-Term Memory [7.926250735066206]
In this paper, we propose CDXLSTM, with a core component that is a powerful XLSTM-based feature enhancement layer.<n>Specifically, we introduce a scale-specific Feature Enhancer layer, incorporating a Cross-Temporal Global Perceptron customized for semantic-accurate deep features.<n>We also propose a Cross-Scale Interactive Fusion module to progressively interact global change representations with spatial responses.
arXiv Detail & Related papers (2024-11-12T15:22:14Z)
TIGER: Time-frequency Interleaved Gain Extraction and Reconstruction for Efficient Speech Separation [19.126525226518975]
We propose a speech separation model with significantly reduced parameters and computational costs.<n>TIGER leverages prior knowledge to divide frequency bands and compresses frequency information.<n>We show that TIGER significantly reduces the number of parameters by 94.3% and the MACs by 95.3% on EchoSet and real-world data.
arXiv Detail & Related papers (2024-10-02T12:21:06Z)
SPMamba: State-space model is all you need in speech separation [20.168153319805665]
CNN-based speech separation models face local receptive field limitations and cannot effectively capture long time dependencies. We introduce an innovative speech separation method called SPMamba. This model builds upon the robust TF-GridNet architecture, replacing its traditional BLSTM modules with bidirectional Mamba modules.
arXiv Detail & Related papers (2024-04-02T16:04:31Z)
Spatial-Spectral Residual Network for Hyperspectral Image Super-Resolution [82.1739023587565]
We propose a novel spectral-spatial residual network for hyperspectral image super-resolution (SSRNet) Our method can effectively explore spatial-spectral information by using 3D convolution instead of 2D convolution, which enables the network to better extract potential information. In each unit, we employ spatial and temporal separable 3D convolution to extract spatial and spectral information, which not only reduces unaffordable memory usage and high computational cost, but also makes the network easier to train.
arXiv Detail & Related papers (2020-01-14T03:34:55Z)
Temporal-Spatial Neural Filter: Direction Informed End-to-End Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals. Two main challenges are the complex acoustic environment and the real-time processing requirement. We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.