Related papers: Deep Neural Networks for Automatic Speech Processing: A Survey from Large Corpora to Limited Data

Deep Neural Networks for Automatic Speech Processing: A Survey from Large Corpora to Limited Data

URL: http://arxiv.org/abs/2003.04241v1
Date: Mon, 9 Mar 2020 16:26:30 GMT
Title: Deep Neural Networks for Automatic Speech Processing: A Survey from Large Corpora to Limited Data
Authors: Vincent Roger, J\'er\^ome Farinas and Julien Pinquier
Abstract summary: Most state-of-the-art speech systems are using Deep Neural Networks (DNNs) These systems require a large amount of data to be learned. We position ourselves for the following speech processing tasks: Automatic Speech Recognition, speaker identification and emotion recognition.
Score: 1.2031796234206138
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Most state-of-the-art speech systems are using Deep Neural Networks (DNNs). Those systems require a large amount of data to be learned. Hence, learning state-of-the-art frameworks on under-resourced speech languages/problems is a difficult task. Problems could be the limited amount of data for impaired speech. Furthermore, acquiring more data and/or expertise is time-consuming and expensive. In this paper we position ourselves for the following speech processing tasks: Automatic Speech Recognition, speaker identification and emotion recognition. To assess the problem of limited data, we firstly investigate state-of-the-art Automatic Speech Recognition systems as it represents the hardest tasks (due to the large variability in each language). Next, we provide an overview of techniques and tasks requiring fewer data. In the last section we investigate few-shot techniques as we interpret under-resourced speech as a few-shot problem. In that sense we propose an overview of few-shot techniques and perspectives of using such techniques for the focused speech problems in this survey. It occurs that the reviewed techniques are not well adapted for large datasets. Nevertheless, some promising results from the literature encourage the usage of such techniques for speech processing.

Related papers

Speech Unlearning [14.755831733659699]
We introduce machine unlearning for speech tasks, a novel and underexplored research problem.<n>It aims to efficiently and effectively remove the influence of specific data from trained speech models without full retraining.<n>It has important applications in privacy preservation, removal of outdated or noisy data, and bias mitigation.
arXiv Detail & Related papers (2025-06-01T06:04:16Z)
ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams [16.172599163455693]
We leverage high-quality data from linguistically or geographically related languages to improve TTS for the target language. Second, we utilize low-quality Automatic Speech Recognition (ASR) data recorded in non-studio environments, which is refined using denoising and speech enhancement models. Third, we apply knowledge distillation from large-scale models using synthetic data to generate more robust outputs.
arXiv Detail & Related papers (2024-10-23T14:18:25Z)
Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study [68.88536866933038]
Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies. Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations. Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length.
arXiv Detail & Related papers (2023-09-27T17:21:13Z)
Panning for gold: Lessons learned from the platform-agnostic automated detection of political content in textual data [48.7576911714538]
We discuss how these techniques can be used to detect political content across different platforms. We compare the performance of three groups of detection techniques relying on dictionaries, supervised machine learning, or neural networks. Our results show the limited impact of preprocessing on model performance, with the best results for less noisy data being achieved by neural network- and machine-learning-based models.
arXiv Detail & Related papers (2022-07-01T15:23:23Z)
Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains. Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods. This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z)
Automated Audio Captioning: an Overview of Recent Progress and New Challenges [56.98522404673527]
Automated audio captioning is a cross-modal translation task that aims to generate natural language descriptions for given audio clips. We present a comprehensive review of the published contributions in automated audio captioning, from a variety of existing approaches to evaluation metrics and datasets.
arXiv Detail & Related papers (2022-05-12T08:36:35Z)
Deep Learning for Hate Speech Detection: A Comparative Study [54.42226495344908]
We present here a large-scale empirical comparison of deep and shallow hate-speech detection methods. Our goal is to illuminate progress in the area, and identify strengths and weaknesses in the current state-of-the-art. In doing so we aim to provide guidance as to the use of hate-speech detection in practice, quantify the state-of-the-art, and identify future research directions.
arXiv Detail & Related papers (2022-02-19T03:48:20Z)
Data Augmentation for Speech Recognition in Maltese: A Low-Resource Perspective [4.6898263272139795]
We consider data augmentation techniques for improving speech recognition in Maltese. We consider three types of data augmentation: unsupervised training, multilingual training and the use of synthesized speech as training data. Our results show that combining the three data augmentation techniques studied here lead us to an absolute WER improvement of 15% without the use of a language model.
arXiv Detail & Related papers (2021-11-15T14:28:21Z)
Speaker-Conditioned Hierarchical Modeling for Automated Speech Scoring [60.55025339250815]
We propose a novel deep learning technique for non-native ASS, called speaker-conditioned hierarchical modeling. We take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. In our technique, we take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. We extract context from these responses and feed them as additional speaker-specific context to our network to score a particular response.
arXiv Detail & Related papers (2021-08-30T07:00:28Z)
Automatic Speech Summarisation: A Scoping Review [7.755991028607979]
This scoping review maps the speech summarisation literature with no restrictions on time frame, language summarised, research method, or paper type. We reviewed a total of 110 papers out of a set of 153 found through a literature search and extracted speech features used, methods, scope, and training corpora.
arXiv Detail & Related papers (2020-08-27T03:15:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.