Tune-In: Training Under Negative Environments with Interference for
Attention Networks Simulating Cocktail Party Effect
- URL: http://arxiv.org/abs/2103.01461v1
- Date: Tue, 2 Mar 2021 04:03:37 GMT
- Title: Tune-In: Training Under Negative Environments with Interference for
Attention Networks Simulating Cocktail Party Effect
- Authors: Jun Wang, Max W. Y. Lam, Dan Su, Dong Yu
- Abstract summary: We propose a novel attention network called Tune-In, abbreviated for training under negative environments with interference.
It firstly learns two separate spaces of speaker-knowledge and speech-stimuli based on a shared feature space.
Between the two spaces, information is cast towards each other via a novel cross- and dual-attention mechanism.
- Score: 37.697375719184926
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the cocktail party problem and propose a novel attention network
called Tune-In, abbreviated for training under negative environments with
interference. It firstly learns two separate spaces of speaker-knowledge and
speech-stimuli based on a shared feature space, where a new block structure is
designed as the building block for all spaces, and then cooperatively solves
different tasks. Between the two spaces, information is cast towards each other
via a novel cross- and dual-attention mechanism, mimicking the bottom-up and
top-down processes of a human's cocktail party effect. It turns out that
substantially discriminative and generalizable speaker representations can be
learnt in severely interfered conditions via our self-supervised training. The
experimental results verify this seeming paradox. The learnt speaker embedding
has superior discriminative power than a standard speaker verification method;
meanwhile, Tune-In achieves remarkably better speech separation performances in
terms of SI-SNRi and SDRi consistently in all test modes, and especially at
lower memory and computational consumption, than state-of-the-art benchmark
systems.
Related papers
- Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion [93.32354378820648]
We introduce MVSD, a mutual learning framework based on diffusion models.
MVSD considers the two tasks symmetrically, exploiting the reciprocal relationship to facilitate learning from inverse tasks.
Our framework can improve the performance of the reverberator and dereverberator.
arXiv Detail & Related papers (2024-07-15T00:47:56Z) - RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification [8.90841350214225]
We introduce a dual-encoder architecture that facilitates the estimation of room parameters directly from speech utterances.
A contrastive loss encoder function is employed to embed the speech and the acoustic response jointly.
In the test phase, only the reverberant utterance is available, and its embedding is used for the task of room shape classification.
arXiv Detail & Related papers (2024-06-05T10:13:55Z) - Online speaker diarization of meetings guided by speech separation [0.0]
Overlapped speech is notoriously problematic for speaker diarization systems.
We introduce a new speech separation-guided diarization scheme suitable for the online speaker diarization of long meeting recordings.
arXiv Detail & Related papers (2024-01-30T09:09:22Z) - Multi-Dimensional and Multi-Scale Modeling for Speech Separation
Optimized by Discriminative Learning [9.84949849886926]
Intra-SE-Conformer and Inter-Transformer (ISCIT) for speech separation.
New network SE-Conformer can model audio sequences in multiple dimensions and scales.
arXiv Detail & Related papers (2023-03-07T08:53:20Z) - Learning from human perception to improve automatic speaker verification
in style-mismatched conditions [21.607777746331998]
Our prior experiments show that humans and machines seem to employ different approaches to speaker discrimination.
We use insights learnt from human perception to design a new training loss function that we refer to as "CllrCE loss"
CllrCE loss uses both speaker-specific idiosyncrasies and relative acoustic distances between speakers to train the ASV system.
arXiv Detail & Related papers (2022-06-28T01:24:38Z) - Preliminary study on using vector quantization latent spaces for TTS/VC
systems with consistent performance [55.10864476206503]
We investigate the use of quantized vectors to model the latent linguistic embedding.
By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding.
Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations.
arXiv Detail & Related papers (2021-06-25T07:51:35Z) - Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual
Speech Separation [73.1652905564163]
We address the problem of separating individual speech signals from videos using audio-visual neural processing.
Most conventional approaches utilize frame-wise matching criteria to extract shared information between co-occurring audio and video.
We propose a cross-modal affinity network (CaffNet) that learns global correspondence as well as locally-varying affinities between audio and visual streams.
arXiv Detail & Related papers (2021-03-25T15:39:12Z) - Continuous Speech Separation with Conformer [60.938212082732775]
We use transformer and conformer in lieu of recurrent neural networks in the separation system.
We believe capturing global information with the self-attention based method is crucial for the speech separation.
arXiv Detail & Related papers (2020-08-13T09:36:05Z) - Speaker diarization with session-level speaker embedding refinement
using graph neural networks [26.688724154619504]
We present the first use of graph neural networks (GNNs) for the speaker diarization problem, utilizing a GNN to refine speaker embeddings locally.
The speaker embeddings extracted by a pre-trained model are remapped into a new embedding space, in which the different speakers within a single session are better separated.
We show that the clustering performance of the refined speaker embeddings outperforms the original embeddings significantly on both simulated and real meeting data.
arXiv Detail & Related papers (2020-05-22T19:52:51Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.