Related papers: VANPY: Voice Analysis Framework

VANPY: Voice Analysis Framework

URL: http://arxiv.org/abs/2502.17579v2
Date: Sun, 04 May 2025 19:01:26 GMT
Title: VANPY: Voice Analysis Framework
Authors: Gregory Koushnir, Michael Fire, Galit Fuhrmann Alpert, Dima Kagan,
Abstract summary: We develop the VANPY framework for automated pre-processing, feature extraction, and classification of voice data.<n>Four of the framework's components were developed in-house and integrated into the framework to extend speaker characterization capabilities.<n>We demonstrate the framework's ability to extract speaker characteristics on a use-case challenge of analyzing character voices from the movie "Pulp Fiction"
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Voice data is increasingly being used in modern digital communications, yet there is still a lack of comprehensive tools for automated voice analysis and characterization. To this end, we developed the VANPY (Voice Analysis in Python) framework for automated pre-processing, feature extraction, and classification of voice data. The VANPY is an open-source end-to-end comprehensive framework that was developed for the purpose of speaker characterization from voice data. The framework is designed with extensibility in mind, allowing for easy integration of new components and adaptation to various voice analysis applications. It currently incorporates over fifteen voice analysis components - including music/speech separation, voice activity detection, speaker embedding, vocal feature extraction, and various classification models. Four of the VANPY's components were developed in-house and integrated into the framework to extend its speaker characterization capabilities: gender classification, emotion classification, age regression, and height regression. The models demonstrate robust performance across various datasets, although not surpassing state-of-the-art performance. As a proof of concept, we demonstrate the framework's ability to extract speaker characteristics on a use-case challenge of analyzing character voices from the movie "Pulp Fiction." The results illustrate the framework's capability to extract multiple speaker characteristics, including gender, age, height, emotion type, and emotion intensity measured across three dimensions: arousal, dominance, and valence.

Related papers

S-VoCAL: A Dataset and Evaluation Framework for Inferring Speaking Voice Character Attributes in Literature [9.09344103114193]
We present S-VoCAL, the first dataset and evaluation framework dedicated to evaluate the inference of voice-related fictional character attributes.<n>S-VoCAL entails 8 attributes grounded in sociophonetic studies, and 952 character-book pairs derived from Project Gutenberg.<n>Our results suggest that the RAG pipeline reliably infers attributes such as Age or Gender, but struggles on others such as Origin or Physical Health.
arXiv Detail & Related papers (2026-03-01T07:24:16Z)
VoiceAgentBench: Are Voice Assistants ready for agentic tasks? [5.639970295197759]
We introduce VoiceAgentBench, a benchmark to evaluate SpeechLMs in realistic spoken agentic settings.<n>It comprises over 5,500 synthetic spoken queries grounded in Indian context.<n>It measures tool selection accuracy, structural consistency, and the correctness of tool invocations, including adversarial robustness.
arXiv Detail & Related papers (2025-10-09T09:11:38Z)
Marco-Voice Technical Report [35.01600797874603]
The goal of this work is to address longstanding challenges in achieving highly expressive, controllable, and natural speech generation.<n>Our approach introduces an effective speaker-emotion disentanglement mechanism with in-batch contrastive learning.<n>To support comprehensive training and evaluation, we construct CSEMOTIONS, a high-quality emotional speech dataset.
arXiv Detail & Related papers (2025-08-04T04:08:22Z)
SpeechRole: A Large-Scale Dataset and Benchmark for Evaluating Speech Role-Playing Agents [52.29009595100625]
Role-playing agents have emerged as a promising paradigm for achieving personalized interaction and emotional resonance.<n>Existing research primarily focuses on the textual modality, neglecting the critical dimension of speech in realistic interactive scenarios.<n>We construct SpeechRole-Data, a large-scale, high-quality dataset that comprises 98 diverse roles and 112k speech-based single-turn and multi-turn conversations.
arXiv Detail & Related papers (2025-08-04T03:18:36Z)
AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation [55.607230723223346]
This work presents a systematic study of Large Audio Model (LAM) as a Judge, AudioJudge, investigating whether it can provide a unified evaluation framework that addresses both challenges.<n>We explore AudioJudge across audio characteristic detection tasks, including pronunciation, speaking rate, speaker identification and speech quality, and system-level human preference simulation for automated benchmarking.<n>We introduce a multi-aspect ensemble AudioJudge to enable general-purpose multi-aspect audio evaluation. This method decomposes speech assessment into specialized judges for lexical content, speech quality, and paralinguistic features, achieving up to 0.91 Spearman correlation with human preferences on
arXiv Detail & Related papers (2025-07-17T00:39:18Z)
MultiVox: Benchmarking Voice Assistants for Multimodal Interactions [43.55740197419447]
We introduce MultiVox, the first benchmark to evaluate the ability of voice assistants to integrate spoken and visual cues.<n>Our evaluation on 9 state-of-the-art models reveals that, although humans excel at these tasks, current models consistently struggle to produce contextually grounded responses.
arXiv Detail & Related papers (2025-07-14T23:20:42Z)
Analyzing and Improving Speaker Similarity Assessment for Speech Synthesis [20.80178325643714]
In generative speech systems, identity is often assessed using automatic speaker verification (ASV) embeddings.<n>We find that widely used ASV embeddings focus mainly on static features like timbre and pitch range, while neglecting dynamic elements such as rhythm.<n>To address these gaps, we propose U3D, a metric that evaluates speakers' dynamic rhythm patterns.
arXiv Detail & Related papers (2025-07-02T22:16:42Z)
VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction Models [26.34810950257782]
We propose VocalBench, a benchmark designed to evaluate speech interaction models' capabilities in vocal communication.<n>VocalBench comprises 9,400 carefully curated instances across four key dimensions: semantic quality, acoustic performance, conversational abilities, and robustness.<n> Experimental results reveal significant variability in current model capabilities, each exhibiting distinct strengths and weaknesses.
arXiv Detail & Related papers (2025-05-21T16:34:07Z)
Automatic Estimation of Singing Voice Musical Dynamics [9.343063100314687]
We propose a methodology for dataset curation. We compile a dataset comprising 509 musical dynamics annotated singing voice performances, aligned with 163 score files. We train a CNN model with varying window sizes to evaluate the effectiveness of estimating musical dynamics. We conclude through our experiments that bark-scale based features outperform log-Mel-features for the task of singing voice dynamics prediction.
arXiv Detail & Related papers (2024-10-27T18:15:18Z)
Where are we in audio deepfake detection? A systematic analysis over generative and detection models [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark. It provides a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content. It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z)
Disentangling Textual and Acoustic Features of Neural Speech Representations [23.486891834252535]
We build upon the Information Bottleneck principle to propose a disentanglement framework for complex speech representations. We apply our framework to emotion recognition and speaker identification downstream tasks.
arXiv Detail & Related papers (2024-10-03T22:48:04Z)
Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue [71.15186328127409]
Paralinguistics-enhanced Generative Pretrained Transformer (ParalinGPT) Model takes the conversational context of text, speech embeddings, and paralinguistic attributes as input prompts within a serialized multitasking framework. We utilize the Switchboard-1 corpus, including its sentiment labels as the paralinguistic attribute, as our spoken dialogue dataset.
arXiv Detail & Related papers (2023-12-23T18:14:56Z)
Disentangling Voice and Content with Self-Supervision for Speaker Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech. It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z)
EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis. This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles. We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z)
Make-A-Voice: Unified Voice Synthesis With Discrete Representation [77.3998611565557]
Make-A-Voice is a unified framework for synthesizing and manipulating voice signals from discrete representations. We show that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models.
arXiv Detail & Related papers (2023-05-30T17:59:26Z)
Residual Information in Deep Speaker Embedding Architectures [4.619541348328938]
This paper introduces an analysis over six sets of speaker embeddings extracted with some of the most recent and high-performing DNN architectures. The dataset includes 46 speakers uttering the same set of prompts, recorded in either a professional studio or their home environments. The results show that the discriminative power of the analyzed embeddings is very high, yet across all the analyzed architectures, residual information is still present in the representations.
arXiv Detail & Related papers (2023-02-06T12:37:57Z)
ASiT: Local-Global Audio Spectrogram vIsion Transformer for Event Classification [42.95038619688867]
ASiT is a novel self-supervised learning framework that captures local and global contextual information by employing group masked model learning and self-distillation. We evaluate our pretrained models on both audio and speech classification tasks, including audio event classification, keyword spotting, and speaker identification.
arXiv Detail & Related papers (2022-11-23T18:21:09Z)
Beyond Voice Identity Conversion: Manipulating Voice Attributes by Adversarial Learning of Structured Disentangled Representations [12.139222986297263]
This paper goes beyond voice identity and presents a neural architecture that allows the manipulation of voice attributes. A novel structured neural network is proposed in which multiple auto-encoders are used to encode speech as a set of idealistically independent linguistic and extra-linguistic representations. The proposed architecture is time-synchronized so that the original voice timing is preserved during conversion which allows lip-sync applications.
arXiv Detail & Related papers (2021-07-26T17:40:43Z)
FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and Fusing Fine-Grained Voice Fragments With Attention [66.77490220410249]
We propose FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0. FragmentVC is able to extract fine-grained voice fragments from the target speaker utterance(s) and fuse them into the desired utterance. This approach is trained with reconstruction loss only without any disentanglement considerations between content and speaker information.
arXiv Detail & Related papers (2020-10-27T09:21:03Z)
Data-driven Detection and Analysis of the Patterns of Creaky Voice [13.829936505895692]
Creaky voice is a quality frequently used as a phrase-boundary marker. The automatic detection and modelling of creaky voice may have implications for speech technology applications.
arXiv Detail & Related papers (2020-05-31T13:34:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.