Related papers: Multi-channel multi-speaker transformer for speech recognition

Multi-channel multi-speaker transformer for speech recognition

URL: http://arxiv.org/abs/2601.02688v1
Date: Tue, 06 Jan 2026 03:46:07 GMT
Title: Multi-channel multi-speaker transformer for speech recognition
Authors: Guo Yifan, Tian Yao, Suo Hongbin, Wan Yulong,
Abstract summary: We propose the multi-channel multi-speaker transformer (M2Former) for far-field multi-speaker ASR.<n>M2Former outperforms neural beamformer, MCT, dual-path RNN with transform-average-concatenate and multi-channel deep clustering based end-to-end systems.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the development of teleconferencing and in-vehicle voice assistants, far-field multi-speaker speech recognition has become a hot research topic. Recently, a multi-channel transformer (MCT) has been proposed, which demonstrates the ability of the transformer to model far-field acoustic environments. However, MCT cannot encode high-dimensional acoustic features for each speaker from mixed input audio because of the interference between speakers. Based on these, we propose the multi-channel multi-speaker transformer (M2Former) for far-field multi-speaker ASR in this paper. Experiments on the SMS-WSJ benchmark show that the M2Former outperforms the neural beamformer, MCT, dual-path RNN with transform-average-concatenate and multi-channel deep clustering based end-to-end systems by 9.2%, 14.3%, 24.9%, and 52.2% respectively, in terms of relative word error rate reduction.

Related papers

Spectron: Target Speaker Extraction using Conditional Transformer with Adversarial Refinement [17.645026729525462]
We propose a transformer-based end-to-end model to extract a target speaker's speech from a mixed audio signal. Our experiments show that the use of a dual path transformer in the separator backbone along with proposed training paradigm improves the CNN baseline by $3.12$ dB points.
arXiv Detail & Related papers (2024-09-02T16:11:12Z)
End-to-end Multichannel Speaker-Attributed ASR: Speaker Guided Decoder and Input Feature Analysis [0.0]
We present an end-to-end multichannel speaker-attributed automatic speech recognition (MC-SA-ASR) system that combines a Conformer-based encoder with multi-frame crosschannel attention and a speaker-attributed Transformer-based decoder.
arXiv Detail & Related papers (2023-10-16T06:40:18Z)
The Volcspeech system for the ICASSP 2022 multi-channel multi-party meeting transcription challenge [18.33054364289739]
This paper describes our submission to ICASSP 2022 Multi-channel Multi-party Meeting Transcription (M2MeT) Challenge. For Track 1, we propose several approaches to empower the clustering-based speaker diarization system. For Track 2, we develop our system using the Conformer model in a joint CTC-attention architecture.
arXiv Detail & Related papers (2022-02-09T03:38:39Z)
Multi-Channel End-to-End Neural Diarization with Distributed Microphones [53.99406868339701]
We replace Transformer encoders in EEND with two types of encoders that process a multi-channel input. We also propose a model adaptation method using only single-channel recordings.
arXiv Detail & Related papers (2021-10-10T03:24:03Z)
Multi-Channel Transformer Transducer for Speech Recognition [15.268402294151468]
We present a novel speech recognition model, Multi-Channel Transformer Transducer (MCTT) MCTT features end-to-end multi-channel training, low cost, and low latency so that it is suitable for streaming decoding in on-device speech recognition.
arXiv Detail & Related papers (2021-08-30T01:50:51Z)
MultiSpeech: Multi-Speaker Text to Speech with Transformer [145.56725956639232]
Transformer-based text to speech (TTS) model (e.g., Transformer TTSciteli 2019neural, FastSpeechciteren 2019fastspeech) has shown the advantages of training and inference efficiency over RNN-based model. We develop a robust and high-quality multi-speaker Transformer TTS system called MultiSpeech, with several specially designed components/techniques to improve text-to-speech alignment.
arXiv Detail & Related papers (2020-06-08T15:05:28Z)
Identify Speakers in Cocktail Parties with End-to-End Attention [48.96655134462949]
This paper presents an end-to-end system that integrates speech source extraction and speaker identification. We propose a new way to jointly optimize these two parts by max-pooling the speaker predictions along the channel dimension. End-to-end training results in a system that recognizes one speaker in a two-speaker broadcast speech mixture with 99.9% accuracy and both speakers with 93.9% accuracy.
arXiv Detail & Related papers (2020-05-22T22:15:16Z)
Audio-visual Multi-channel Recognition of Overlapped Speech [79.21950701506732]
This paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end. Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81% (26.83% relative) and 22.22% (56.87% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 dataset respectively.
arXiv Detail & Related papers (2020-05-18T10:31:19Z)
Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework. It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z)
End-to-End Multi-speaker Speech Recognition with Transformer [88.22355110349933]
We replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture. We also modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation.
arXiv Detail & Related papers (2020-02-10T16:29:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.