Related papers: Disentangling the Impacts of Language and Channel Variability on Speech Separation Networks

Disentangling the Impacts of Language and Channel Variability on Speech Separation Networks

URL: http://arxiv.org/abs/2203.16040v1
Date: Wed, 30 Mar 2022 04:07:23 GMT
Title: Disentangling the Impacts of Language and Channel Variability on Speech Separation Networks
Authors: Fan-Lin Wang, Hung-Shin Lee, Yu Tsao, Hsin-Min Wang
Abstract summary: Domain mismatch between training/test situations due to factors, such as speaker, content, channel, and environment, remains a severe problem for speech separation. In this study, we create several datasets for various experiments. The results show that the impacts of different languages are small enough to be ignored compared to the impacts of different channels.
Score: 25.662237869109433
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Because the performance of speech separation is excellent for speech in which two speakers completely overlap, research attention has been shifted to dealing with more realistic scenarios. However, domain mismatch between training/test situations due to factors, such as speaker, content, channel, and environment, remains a severe problem for speech separation. Speaker and environment mismatches have been studied in the existing literature. Nevertheless, there are few studies on speech content and channel mismatches. Moreover, the impacts of language and channel in these studies are mostly tangled. In this study, we create several datasets for various experiments. The results show that the impacts of different languages are small enough to be ignored compared to the impacts of different channels. In our experiments, training on data recorded by Android phones leads to the best generalizability. Moreover, we provide a new solution for channel mismatch by evaluating projection, where the channel similarity can be measured and used to effectively select additional training data to improve the performance of in-the-wild test data.

Related papers

Layover or Direct Flight: Rethinking Audio-Guided Image Segmentation [65.7990140284317]
We focus on object grounding, i.e., localizing an object of interest in a visual scene based on verbal human instructions.<n>To explore this possibility, we simplify the task by focusing on grounding from single-word spoken instructions.<n>Our results demonstrate that direct grounding from audio is not only feasible but, in some cases, even outperforms transcription-based methods.
arXiv Detail & Related papers (2025-11-27T02:00:28Z)
Multi-Stage Speaker Diarization for Noisy Classrooms [1.4549461207028445]
This study investigates the effectiveness of multi-stage diarization models using Nvidia's NeMo diarization pipeline.<n>We assess the impact of denoising on diarization accuracy and compare various voice activity detection models.<n>We also explore a hybrid VAD approach that integrates Automatic Speech Recognition (ASR) word-level timestamps with frame-level VAD predictions.
arXiv Detail & Related papers (2025-05-16T05:35:06Z)
Hate Speech Detection Using Cross-Platform Social Media Data In English and German Language [6.200058263544999]
This study focuses on detecting bilingual hate speech in YouTube comments. We include factors such as content similarity, definition similarity, and common hate words to measure the impact of datasets on performance. The best performance was obtained by combining datasets from YouTube comments, Twitter, and Gab with an F1-score of 0.74 and 0.68 for English and German YouTube comments.
arXiv Detail & Related papers (2024-10-02T10:22:53Z)
Contrastive Learning-based Chaining-Cluster for Multilingual Voice-Face Association [24.843733099049015]
This paper introduces our novel solution to the Face-Voice Association in Multilingual Environments (FAME) 2024 challenge. It focuses on a contrastive learning-based chaining-cluster method to enhance face-voice association. We conducted extensive experiments to investigate the impact of language on face-voice association. The results demonstrate the superior performance of our method, and we validate the robustness and effectiveness of our proposed approach.
arXiv Detail & Related papers (2024-08-04T13:24:36Z)
An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios [76.11409260727459]
This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system. We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance.
arXiv Detail & Related papers (2024-06-13T08:16:52Z)
LLMs and Finetuning: Benchmarking cross-domain performance for hate speech detection [9.166963162285064]
This study investigates the effectiveness and adaptability of pre-trained and fine-tuned Large Language Models (LLMs) in identifying hate speech. LLMs offer a huge advantage over the state-of-the-art even without pretraining.
arXiv Detail & Related papers (2023-10-29T10:07:32Z)
Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z)
Looking Similar, Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning [3.6204417068568424]
We use dubbed versions of movies and television shows to augment cross-modal contrastive learning. Our approach learns to represent alternate audio tracks, differing only in speech, similarly to the same video.
arXiv Detail & Related papers (2023-04-12T04:17:45Z)
Multi-Dimensional and Multi-Scale Modeling for Speech Separation Optimized by Discriminative Learning [9.84949849886926]
Intra-SE-Conformer and Inter-Transformer (ISCIT) for speech separation. New network SE-Conformer can model audio sequences in multiple dimensions and scales.
arXiv Detail & Related papers (2023-03-07T08:53:20Z)
Improving Distortion Robustness of Self-supervised Speech Processing Tasks with Domain Adaptation [60.26511271597065]
Speech distortions are a long-standing problem that degrades the performance of supervisely trained speech processing models. It is high time that we enhance the robustness of speech processing models to obtain good performance when encountering speech distortions.
arXiv Detail & Related papers (2022-03-30T07:25:52Z)
Robust Audio-Visual Instance Discrimination [79.74625434659443]
We present a self-supervised learning method to learn audio and video representations. We address the problems of audio-visual instance discrimination and improve transfer learning performance.
arXiv Detail & Related papers (2021-03-29T19:52:29Z)
FluentNet: End-to-End Detection of Speech Disfluency with Deep Learning [23.13972240042859]
We propose an end-to-end deep neural network, FluentNet, capable of detecting a number of different disfluency types. FluentNet consists of a Squeeze-and-Excitation Residual convolutional neural network which facilitate the learning of strong spectral frame-level representations. We present a disfluency dataset based on the public LibriSpeech dataset with synthesized stutters.
arXiv Detail & Related papers (2020-09-23T21:51:29Z)
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks. Traditionally, these tasks have been tackled using signal processing and machine learning techniques. Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z)
"Notic My Speech" -- Blending Speech Patterns With Multimedia [65.91370924641862]
We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding. Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate. We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
arXiv Detail & Related papers (2020-06-12T06:51:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.