BeamTransformer: Microphone Array-based Overlapping Speech Detection
- URL: http://arxiv.org/abs/2109.04049v1
- Date: Thu, 9 Sep 2021 06:10:48 GMT
- Title: BeamTransformer: Microphone Array-based Overlapping Speech Detection
- Authors: Siqi Zheng, Shiliang Zhang, Weilong Huang, Qian Chen, Hongbin Suo,
Ming Lei, Jinwei Feng, Zhijie Yan
- Abstract summary: BeamTransformer seeks to optimize modeling of sequential relationship among signals from different spatial direction.
BeamTransformer exceeds in learning to identify the relationship among different beam sequences.
BeamTransformer takes one step further, as speech from overlapped speakers have been internally separated into different beams.
- Score: 52.11665331754917
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose BeamTransformer, an efficient architecture to leverage
beamformer's edge in spatial filtering and transformer's capability in context
sequence modeling. BeamTransformer seeks to optimize modeling of sequential
relationship among signals from different spatial direction. Overlapping speech
detection is one of the tasks where such optimization is favorable. In this
paper we effectively apply BeamTransformer to detect overlapping segments.
Comparing to single-channel approach, BeamTransformer exceeds in learning to
identify the relationship among different beam sequences and hence able to make
predictions not only from the acoustic signals but also the localization of the
source. The results indicate that a successful incorporation of microphone
array signals can lead to remarkable gains. Moreover, BeamTransformer takes one
step further, as speech from overlapped speakers have been internally separated
into different beams.
Related papers
- A unified multichannel far-field speech recognition system: combining
neural beamforming with attention based end-to-end model [14.795953417531907]
We propose a unified multichannel far-field speech recognition system that combines the neural beamforming and transformer-based Listen, Spell, Attend (LAS) speech recognition system.
The proposed method achieve 19.26% improvement when compared with a strong baseline.
arXiv Detail & Related papers (2024-01-05T07:11:13Z) - Sionna RT: Differentiable Ray Tracing for Radio Propagation Modeling [65.17711407805756]
Sionna is a GPU-accelerated open-source library for link-level simulations based on.
Since release v0.14 it integrates a differentiable ray tracer (RT) for the simulation of radio wave propagation.
arXiv Detail & Related papers (2023-03-20T13:40:11Z) - MIMO-DBnet: Multi-channel Input and Multiple Outputs DOA-aware
Beamforming Network for Speech Separation [55.533789120204055]
We propose an end-to-end beamforming network for direction guided speech separation given merely the mixture signal.
Specifically, we design a multi-channel input and multiple outputs architecture to predict the direction-of-arrival based embeddings and beamforming weights for each source.
arXiv Detail & Related papers (2022-12-07T01:52:40Z) - SepTr: Separable Transformer for Audio Spectrogram Processing [74.41172054754928]
We propose a new vision transformer architecture called Separable Transformer (SepTr)
SepTr employs two transformer blocks in a sequential manner, the first attending to tokens within the same frequency bin, and the second attending to tokens within the same time interval.
We conduct experiments on three benchmark data sets, showing that our architecture outperforms conventional vision transformers and other state-of-the-art methods.
arXiv Detail & Related papers (2022-03-17T19:48:43Z) - A Deep-Bayesian Framework for Adaptive Speech Duration Modification [20.99099283004413]
We use a Bayesian framework to define a latent attention map that links frames of the input and target utterances.
We train a masked convolutional encoder-decoder network to produce this attention map via a version of the mean absolute error loss function.
We show that our technique results in a high quality of generated speech that is on par with state-of-the-art vocoders.
arXiv Detail & Related papers (2021-07-11T05:53:07Z) - BERT for Joint Multichannel Speech Dereverberation with Spatial-aware
Tasks [6.876734825043823]
We propose a method for joint multichannel speech dereverberation with two spatial-aware tasks.
The proposed method addresses involved tasks as a sequence to sequence mapping problem.
arXiv Detail & Related papers (2020-10-21T11:05:17Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Transformer with Bidirectional Decoder for Speech Recognition [32.56014992915183]
We introduce a bidirectional speech transformer to utilize the different directional contexts simultaneously.
Specifically, the outputs of our proposed transformer include a left-to-right target, and a right-to-left target.
In inference stage, we use the introduced bidirectional beam search method, which can generate left-to-right candidates and also generate right-to-left candidates.
arXiv Detail & Related papers (2020-08-11T02:12:42Z) - Relative Positional Encoding for Speech Recognition and Direct
Translation [72.64499573561922]
We adapt the relative position encoding scheme to the Speech Transformer.
As a result, the network can better adapt to the variable distributions present in speech data.
arXiv Detail & Related papers (2020-05-20T09:53:06Z) - End-to-End Whisper to Natural Speech Conversion using Modified
Transformer Network [0.8399688944263843]
We introduce whisper-to-natural-speech conversion using sequence-to-sequence approach.
We investigate different features like Mel frequency cepstral coefficients and smoothed spectral features.
The proposed networks are trained end-to-end using supervised approach for feature-to-feature transformation.
arXiv Detail & Related papers (2020-04-20T14:47:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.