Psychoacoustic Challenges Of Speech Enhancement On VoIP Platforms
- URL: http://arxiv.org/abs/2310.07161v2
- Date: Tue, 21 Nov 2023 07:54:34 GMT
- Title: Psychoacoustic Challenges Of Speech Enhancement On VoIP Platforms
- Authors: Joseph Konan, Ojas Bhargave, Shikhar Agnihotri, Shuo Han, Yunyang
Zeng, Ankit Shah, Bhiksha Raj
- Abstract summary: This research is rooted in the exploration of proprietary sender-side denoising effects.
A methodological novelty is introduced via the Oaxaca decomposition.
psychoacoustic metrics, specifically PESQ and STOI, were harnessed to furnish a comprehensive understanding of speech alterations.
- Score: 20.081363744228753
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Within the ambit of VoIP (Voice over Internet Protocol) telecommunications,
the complexities introduced by acoustic transformations merit rigorous
analysis. This research, rooted in the exploration of proprietary sender-side
denoising effects, meticulously evaluates platforms such as Google Meets and
Zoom. The study draws upon the Deep Noise Suppression (DNS) 2020 dataset,
ensuring a structured examination tailored to various denoising settings and
receiver interfaces. A methodological novelty is introduced via the Oaxaca
decomposition, traditionally an econometric tool, repurposed herein to analyze
acoustic-phonetic perturbations within VoIP systems. To further ground the
implications of these transformations, psychoacoustic metrics, specifically
PESQ and STOI, were harnessed to furnish a comprehensive understanding of
speech alterations. Cumulatively, the insights garnered underscore the
intricate landscape of VoIP-influenced acoustic dynamics. In addition to the
primary findings, a multitude of metrics are reported, extending the research
purview. Moreover, out-of-domain benchmarking for both time and time-frequency
domain speech enhancement models is included, thereby enhancing the depth and
applicability of this inquiry. Repository:
github.com/deepology/VoIP-DNS-Challenge
Related papers
- Probing the Information Encoded in Neural-based Acoustic Models of
Automatic Speech Recognition Systems [7.207019635697126]
This article aims to determine which and where information is located in an automatic speech recognition acoustic model (AM)
Experiments are performed on speaker verification, acoustic environment classification, gender classification, tempo-distortion detection systems and speech sentiment/emotion identification.
Analysis showed that neural-based AMs hold heterogeneous information that seems surprisingly uncorrelated with phoneme recognition.
arXiv Detail & Related papers (2024-02-29T18:43:53Z) - Neural Acoustic Context Field: Rendering Realistic Room Impulse Response
With Neural Fields [61.07542274267568]
This letter proposes a novel Neural Acoustic Context Field approach, called NACF, to parameterize an audio scene.
Driven by the unique properties of RIR, we design a temporal correlation module and multi-scale energy decay criterion.
Experimental results show that NACF outperforms existing field-based methods by a notable margin.
arXiv Detail & Related papers (2023-09-27T19:50:50Z) - NPVForensics: Jointing Non-critical Phonemes and Visemes for Deepfake
Detection [50.33525966541906]
Existing multimodal detection methods capture audio-visual inconsistencies to expose Deepfake videos.
We propose a novel Deepfake detection method to mine the correlation between Non-critical Phonemes and Visemes, termed NPVForensics.
Our model can be easily adapted to the downstream Deepfake datasets with fine-tuning.
arXiv Detail & Related papers (2023-06-12T06:06:05Z) - Synthetic Voice Detection and Audio Splicing Detection using
SE-Res2Net-Conformer Architecture [2.9805017559176883]
This paper extends the existing Res2Net by involving the recent Conformer block to further exploit the local patterns on acoustic features.
Experimental results on ASVspoof 2019 database show that the proposed SE-Res2Net-Conformer architecture is able to improve the spoofing countermeasures performance.
This paper also proposes to re-formulate the existing audio splicing detection problem.
arXiv Detail & Related papers (2022-10-07T14:30:13Z) - End-to-End Binaural Speech Synthesis [71.1869877389535]
We present an end-to-end speech synthesis system that combines a low-bitrate audio system with a powerful decoder.
We demonstrate the capability of the adversarial loss in capturing environment effects needed to create an authentic auditory scene.
arXiv Detail & Related papers (2022-07-08T05:18:36Z) - Improving Perceptual Quality by Phone-Fortified Perceptual Loss using
Wasserstein Distance for Speech Enhancement [23.933935913913043]
We propose a phone-fortified perceptual loss (PFPL) that takes phonetic information into account for training SE models.
To effectively incorporate the phonetic information, the PFPL is computed based on latent representations of the wav2vec model.
Our experimental results first reveal that the PFPL is more correlated with the perceptual evaluation metrics, as compared to signal-level losses.
arXiv Detail & Related papers (2020-10-28T18:34:28Z) - Cross-domain Adaptation with Discrepancy Minimization for
Text-independent Forensic Speaker Verification [61.54074498090374]
This study introduces a CRSS-Forensics audio dataset collected in multiple acoustic environments.
We pre-train a CNN-based network using the VoxCeleb data, followed by an approach which fine-tunes part of the high-level network layers with clean speech from CRSS-Forensics.
arXiv Detail & Related papers (2020-09-05T02:54:33Z) - HiFi-GAN: High-Fidelity Denoising and Dereverberation Based on Speech
Deep Features in Adversarial Networks [29.821666380496637]
HiFi-GAN transforms recorded speech to sound as though it had been recorded in a studio.
It relies on the deep feature matching losses of the discriminators to improve the perceptual quality of enhanced speech.
It significantly outperforms state-of-the-art baseline methods in both objective and subjective experiments.
arXiv Detail & Related papers (2020-06-10T07:24:39Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.