Related papers: Shennong: a Python toolbox for audio speech features extraction

Shennong: a Python toolbox for audio speech features extraction

URL: http://arxiv.org/abs/2112.05555v1
Date: Fri, 10 Dec 2021 14:08:52 GMT
Title: Shennong: a Python toolbox for audio speech features extraction
Authors: Mathieu Bernard and Maxime Poli and Julien Karadayi and Emmanuel Dupoux
Abstract summary: Shennong is a Python toolbox and command-line utility for speech features extraction. It implements a wide range of well-established state of art algorithms including spectro-temporal filters, pre-trained neural networks, pitch estimators and speaker normalization methods. This paper illustrates its use on three applications: a comparison of speech features performances on a phones discrimination task, an analysis of a Vocal Tract Length Normalization model as a function of the speech duration used for training and a comparison of pitch estimation algorithms under various noise conditions.
Score: 15.816237141746562
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We introduce Shennong, a Python toolbox and command-line utility for speech features extraction. It implements a wide range of well-established state of art algorithms including spectro-temporal filters such as Mel-Frequency Cepstral Filterbanks or Predictive Linear Filters, pre-trained neural networks, pitch estimators as well as speaker normalization methods and post-processing algorithms. Shennong is an open source, easy-to-use, reliable and extensible framework. The use of Python makes the integration to others speech modeling and machine learning tools easy. It aims to replace or complement several heterogeneous software, such as Kaldi or Praat. After describing the Shennong software architecture, its core components and implemented algorithms, this paper illustrates its use on three applications: a comparison of speech features performances on a phones discrimination task, an analysis of a Vocal Tract Length Normalization model as a function of the speech duration used for training and a comparison of pitch estimation algorithms under various noise conditions.

Related papers

Prak: An automatic phonetic alignment tool for Czech [0.0]
Free open-source tool generates phone sequences from Czech text and time-aligns them with audio. A Czech pronunciation generator is composed of simple rule-based blocks capturing the logic of the language.
arXiv Detail & Related papers (2023-04-17T16:51:24Z)
DeepFry: Identifying Vocal Fry Using Deep Neural Networks [16.489251286870704]
Vocal fry or creaky voice refers to a voice quality characterized by irregular glottal opening and low pitch. Due to its irregular periodicity, creaky voice challenges automatic speech processing and recognition systems. This paper proposes a deep learning model to detect creaky voice in fluent speech.
arXiv Detail & Related papers (2022-03-31T13:23:24Z)
Self-supervised Learning with Random-projection Quantizer for Speech Recognition [51.24368930992091]
We present a simple and effective self-supervised learning approach for speech recognition. The approach learns a model to predict masked speech signals, in the form of discrete labels. It achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models.
arXiv Detail & Related papers (2022-02-03T21:29:04Z)
Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation. We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead. When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z)
QuaPy: A Python-Based Framework for Quantification [76.22817970624875]
QuaPy is an open-source framework for performing quantification (a.k.a. supervised prevalence estimation) It is written in Python and can be installed via pip.
arXiv Detail & Related papers (2021-06-18T13:57:11Z)
SpeechBrain: A General-Purpose Speech Toolkit [73.0404642815335]
SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the research and development of neural speech processing technologies. It achieves competitive or state-of-the-art performance in a wide range of speech benchmarks.
arXiv Detail & Related papers (2021-06-08T18:22:56Z)
Learning Feature Weights using Reward Modeling for Denoising Parallel Corpora [36.292020779233056]
This work presents an alternative approach which learns weights for multiple sentence-level features. We apply this technique to building Neural Machine Translation (NMT) systems using the Paracrawl corpus for Estonian-English. We analyze the sensitivity of this method to different types of noise and explore if the learned weights generalize to other language pairs.
arXiv Detail & Related papers (2021-03-11T21:45:45Z)
WaDeNet: Wavelet Decomposition based CNN for Speech Processing [0.0]
WaDeNet is an end-to-end model for mobile speech processing. WaDeNet embeds wavelet decomposition of the speech signal within the architecture.
arXiv Detail & Related papers (2020-11-11T06:43:03Z)
Language Through a Prism: A Spectral Approach for Multiscale Language Representations [30.224517199646993]
We show that signal processing provides a natural framework for separating structure across scales. We apply spectral filters to the activations of a neuron across an input, producing filtered embeddings that perform well on part of speech tagging. We also present a prism layer for training models, which uses spectral filters to constrain different neurons to model structure at different scales.
arXiv Detail & Related papers (2020-11-09T23:17:43Z)
Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach. In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module. Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z)
Anchor & Transform: Learning Sparse Embeddings for Large Vocabularies [60.285091454321055]
We design a simple and efficient embedding algorithm that learns a small set of anchor embeddings and a sparse transformation matrix. On text classification, language modeling, and movie recommendation benchmarks, we show that ANT is particularly suitable for large vocabulary sizes.
arXiv Detail & Related papers (2020-03-18T13:07:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.