Related papers: Tune-In: Training Under Negative Environments with Interference for Attention Networks Simulating Cocktail Party Effect

Tune-In: Training Under Negative Environments with Interference for Attention Networks Simulating Cocktail Party Effect

URL: http://arxiv.org/abs/2103.01461v1
Date: Tue, 2 Mar 2021 04:03:37 GMT
Title: Tune-In: Training Under Negative Environments with Interference for Attention Networks Simulating Cocktail Party Effect
Authors: Jun Wang, Max W. Y. Lam, Dan Su, Dong Yu
Abstract summary: We propose a novel attention network called Tune-In, abbreviated for training under negative environments with interference. It firstly learns two separate spaces of speaker-knowledge and speech-stimuli based on a shared feature space. Between the two spaces, information is cast towards each other via a novel cross- and dual-attention mechanism.
Score: 37.697375719184926
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the cocktail party problem and propose a novel attention network called Tune-In, abbreviated for training under negative environments with interference. It firstly learns two separate spaces of speaker-knowledge and speech-stimuli based on a shared feature space, where a new block structure is designed as the building block for all spaces, and then cooperatively solves different tasks. Between the two spaces, information is cast towards each other via a novel cross- and dual-attention mechanism, mimicking the bottom-up and top-down processes of a human's cocktail party effect. It turns out that substantially discriminative and generalizable speaker representations can be learnt in severely interfered conditions via our self-supervised training. The experimental results verify this seeming paradox. The learnt speaker embedding has superior discriminative power than a standard speaker verification method; meanwhile, Tune-In achieves remarkably better speech separation performances in terms of SI-SNRi and SDRi consistently in all test modes, and especially at lower memory and computational consumption, than state-of-the-art benchmark systems.

Related papers

End-to-end multi-channel speaker extraction and binaural speech synthesis [26.373624846079686]
Speech clarity and spatial audio immersion are two most critical factors in enhancing remote conferencing experiences.<n>We introduce an end-to-end deep learning framework that has the capacity of mapping multi-channel noisy and reverberant signals to clean and spatialized speech directly.<n>In this framework, a novel magnitude-weighted interaural level difference loss function is proposed that aims to improve the accuracy of spatial rendering.
arXiv Detail & Related papers (2024-10-08T06:55:35Z)
Two-stage Framework for Robust Speech Emotion Recognition Using Target Speaker Extraction in Human Speech Noise Conditions [25.490988931354185]
We propose a novel two-stage framework for the problem by cascading target speaker extraction (TSE) method and speech emotion recognition (SER) We first train a TSE model to extract the speech of target speaker from a mixture. Then, in the second stage, we utilize the extracted speech for SER training. Our developed system achieves a 14.33% improvement in unweighted accuracy (UA) compared to a baseline without using TSE method.
arXiv Detail & Related papers (2024-09-29T07:04:50Z)
Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion [93.32354378820648]
We introduce MVSD, a mutual learning framework based on diffusion models. MVSD considers the two tasks symmetrically, exploiting the reciprocal relationship to facilitate learning from inverse tasks. Our framework can improve the performance of the reverberator and dereverberator.
arXiv Detail & Related papers (2024-07-15T00:47:56Z)
RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification [8.90841350214225]
We introduce a dual-encoder architecture that facilitates the estimation of room parameters directly from speech utterances. A contrastive loss encoder function is employed to embed the speech and the acoustic response jointly. In the test phase, only the reverberant utterance is available, and its embedding is used for the task of room shape classification.
arXiv Detail & Related papers (2024-06-05T10:13:55Z)
Online speaker diarization of meetings guided by speech separation [0.0]
Overlapped speech is notoriously problematic for speaker diarization systems. We introduce a new speech separation-guided diarization scheme suitable for the online speaker diarization of long meeting recordings.
arXiv Detail & Related papers (2024-01-30T09:09:22Z)
Multi-Dimensional and Multi-Scale Modeling for Speech Separation Optimized by Discriminative Learning [9.84949849886926]
Intra-SE-Conformer and Inter-Transformer (ISCIT) for speech separation. New network SE-Conformer can model audio sequences in multiple dimensions and scales.
arXiv Detail & Related papers (2023-03-07T08:53:20Z)
Learning from human perception to improve automatic speaker verification in style-mismatched conditions [21.607777746331998]
Our prior experiments show that humans and machines seem to employ different approaches to speaker discrimination. We use insights learnt from human perception to design a new training loss function that we refer to as "CllrCE loss" CllrCE loss uses both speaker-specific idiosyncrasies and relative acoustic distances between speakers to train the ASV system.
arXiv Detail & Related papers (2022-06-28T01:24:38Z)
Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance [55.10864476206503]
We investigate the use of quantized vectors to model the latent linguistic embedding. By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding. Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations.
arXiv Detail & Related papers (2021-06-25T07:51:35Z)
Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual Speech Separation [73.1652905564163]
We address the problem of separating individual speech signals from videos using audio-visual neural processing. Most conventional approaches utilize frame-wise matching criteria to extract shared information between co-occurring audio and video. We propose a cross-modal affinity network (CaffNet) that learns global correspondence as well as locally-varying affinities between audio and visual streams.
arXiv Detail & Related papers (2021-03-25T15:39:12Z)
Continuous Speech Separation with Conformer [60.938212082732775]
We use transformer and conformer in lieu of recurrent neural networks in the separation system. We believe capturing global information with the self-attention based method is crucial for the speech separation.
arXiv Detail & Related papers (2020-08-13T09:36:05Z)
Temporal-Spatial Neural Filter: Direction Informed End-to-End Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals. Two main challenges are the complex acoustic environment and the real-time processing requirement. We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.