VANPY: Voice Analysis Framework
- URL: http://arxiv.org/abs/2502.17579v1
- Date: Mon, 17 Feb 2025 21:12:57 GMT
- Title: VANPY: Voice Analysis Framework
- Authors: Gregory Koushnir, Michael Fire, Galit Fuhrmann Alpert, Dima Kagan,
- Abstract summary: We develop the VANPY framework for automated pre-processing, feature extraction, and classification of voice data.<n>Four of the framework's components were developed in-house and integrated into the framework to extend speaker characterization capabilities.<n>We demonstrate the framework's ability to extract speaker characteristics on a use-case challenge of analyzing character voices from the movie "Pulp Fiction"
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Voice data is increasingly being used in modern digital communications, yet there is still a lack of comprehensive tools for automated voice analysis and characterization. To this end, we developed the VANPY (Voice Analysis in Python) framework for automated pre-processing, feature extraction, and classification of voice data. The VANPY is an open-source end-to-end comprehensive framework that was developed for the purpose of speaker characterization from voice data. The framework is designed with extensibility in mind, allowing for easy integration of new components and adaptation to various voice analysis applications. It currently incorporates over fifteen voice analysis components - including music/speech separation, voice activity detection, speaker embedding, vocal feature extraction, and various classification models. Four of the VANPY's components were developed in-house and integrated into the framework to extend its speaker characterization capabilities: gender classification, emotion classification, age regression, and height regression. The models demonstrate robust performance across various datasets, although not surpassing state-of-the-art performance. As a proof of concept, we demonstrate the framework's ability to extract speaker characteristics on a use-case challenge of analyzing character voices from the movie "Pulp Fiction." The results illustrate the framework's capability to extract multiple speaker characteristics, including gender, age, height, emotion type, and emotion intensity measured across three dimensions: arousal, dominance, and valence.
Related papers
- Automatic Estimation of Singing Voice Musical Dynamics [9.343063100314687]
We propose a methodology for dataset curation.
We compile a dataset comprising 509 musical dynamics annotated singing voice performances, aligned with 163 score files.
We train a CNN model with varying window sizes to evaluate the effectiveness of estimating musical dynamics.
We conclude through our experiments that bark-scale based features outperform log-Mel-features for the task of singing voice dynamics prediction.
arXiv Detail & Related papers (2024-10-27T18:15:18Z) - Where are we in audio deepfake detection? A systematic analysis over generative and detection models [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.
It provides a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.
It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z) - Disentangling Textual and Acoustic Features of Neural Speech Representations [23.486891834252535]
We build upon the Information Bottleneck principle to propose a disentanglement framework for complex speech representations.
We apply our framework to emotion recognition and speaker identification downstream tasks.
arXiv Detail & Related papers (2024-10-03T22:48:04Z) - Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue [71.15186328127409]
Paralinguistics-enhanced Generative Pretrained Transformer (ParalinGPT)
Model takes the conversational context of text, speech embeddings, and paralinguistic attributes as input prompts within a serialized multitasking framework.
We utilize the Switchboard-1 corpus, including its sentiment labels as the paralinguistic attribute, as our spoken dialogue dataset.
arXiv Detail & Related papers (2023-12-23T18:14:56Z) - Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z) - Make-A-Voice: Unified Voice Synthesis With Discrete Representation [77.3998611565557]
Make-A-Voice is a unified framework for synthesizing and manipulating voice signals from discrete representations.
We show that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models.
arXiv Detail & Related papers (2023-05-30T17:59:26Z) - Residual Information in Deep Speaker Embedding Architectures [4.619541348328938]
This paper introduces an analysis over six sets of speaker embeddings extracted with some of the most recent and high-performing DNN architectures.
The dataset includes 46 speakers uttering the same set of prompts, recorded in either a professional studio or their home environments.
The results show that the discriminative power of the analyzed embeddings is very high, yet across all the analyzed architectures, residual information is still present in the representations.
arXiv Detail & Related papers (2023-02-06T12:37:57Z) - ASiT: Local-Global Audio Spectrogram vIsion Transformer for Event
Classification [42.95038619688867]
ASiT is a novel self-supervised learning framework that captures local and global contextual information by employing group masked model learning and self-distillation.
We evaluate our pretrained models on both audio and speech classification tasks, including audio event classification, keyword spotting, and speaker identification.
arXiv Detail & Related papers (2022-11-23T18:21:09Z) - Beyond Voice Identity Conversion: Manipulating Voice Attributes by
Adversarial Learning of Structured Disentangled Representations [12.139222986297263]
This paper goes beyond voice identity and presents a neural architecture that allows the manipulation of voice attributes.
A novel structured neural network is proposed in which multiple auto-encoders are used to encode speech as a set of idealistically independent linguistic and extra-linguistic representations.
The proposed architecture is time-synchronized so that the original voice timing is preserved during conversion which allows lip-sync applications.
arXiv Detail & Related papers (2021-07-26T17:40:43Z) - FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and
Fusing Fine-Grained Voice Fragments With Attention [66.77490220410249]
We propose FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0.
FragmentVC is able to extract fine-grained voice fragments from the target speaker utterance(s) and fuse them into the desired utterance.
This approach is trained with reconstruction loss only without any disentanglement considerations between content and speaker information.
arXiv Detail & Related papers (2020-10-27T09:21:03Z) - Data-driven Detection and Analysis of the Patterns of Creaky Voice [13.829936505895692]
Creaky voice is a quality frequently used as a phrase-boundary marker.
The automatic detection and modelling of creaky voice may have implications for speech technology applications.
arXiv Detail & Related papers (2020-05-31T13:34:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.