Evaluating the reliability of acoustic speech embeddings
- URL: http://arxiv.org/abs/2007.13542v2
- Date: Fri, 6 Nov 2020 13:08:49 GMT
- Title: Evaluating the reliability of acoustic speech embeddings
- Authors: Robin Algayres, Mohamed Salah Zaiem, Benoit Sagot, Emmanuel Dupoux
- Abstract summary: Speech embeddings are fixed-size acoustic representations of variable-length speech sequences.
Here, we systematically compare two popular metrics, ABX discrimination and Mean Average Precision (MAP), on 5 languages across 17 embedding methods.
We find that overall, ABX and MAP correlate with one another and with frequency estimation.
- Score: 10.5754802112615
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speech embeddings are fixed-size acoustic representations of variable-length
speech sequences. They are increasingly used for a variety of tasks ranging
from information retrieval to unsupervised term discovery and speech
segmentation. However, there is currently no clear methodology to compare or
optimise the quality of these embeddings in a task-neutral way. Here, we
systematically compare two popular metrics, ABX discrimination and Mean Average
Precision (MAP), on 5 languages across 17 embedding methods, ranging from
supervised to fully unsupervised, and using different loss functions
(autoencoders, correspondence autoencoders, siamese). Then we use the ABX and
MAP to predict performances on a new downstream task: the unsupervised
estimation of the frequencies of speech segments in a given corpus. We find
that overall, ABX and MAP correlate with one another and with frequency
estimation. However, substantial discrepancies appear in the fine-grained
distinctions across languages and/or embedding methods. This makes it
unrealistic at present to propose a task-independent silver bullet method for
computing the intrinsic quality of speech embeddings. There is a need for more
detailed analysis of the metrics currently used to evaluate such embeddings.
Related papers
- CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors.
We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models.
In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z) - Establishing degrees of closeness between audio recordings along
different dimensions using large-scale cross-lingual models [4.349838917565205]
We propose a new unsupervised method using ABX tests on audio recordings with carefully curated metadata.
Three experiments are devised: one on room acoustics aspects, one on linguistic genre, and one on phonetic aspects.
The results confirm that the representations extracted from recordings with different linguistic/extra-linguistic characteristics differ along the same lines.
arXiv Detail & Related papers (2024-02-08T11:31:23Z) - SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding
Tasks [88.4408774253634]
Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community.
There are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers.
Recent work has begun to introduce such benchmark for several tasks.
arXiv Detail & Related papers (2022-12-20T18:39:59Z) - Smart Speech Segmentation using Acousto-Linguistic Features with
look-ahead [3.579111205766969]
We present a hybrid approach that leverages both acoustic and language information to improve segmentation.
On average, our models improve segmentation-F0.5 score by 9.8% over baseline.
For the downstream task of machine translation, it improves the translation BLEU score by an average of 1.05 points.
arXiv Detail & Related papers (2022-10-26T03:36:31Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Speaker Embedding-aware Neural Diarization for Flexible Number of
Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels.
Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z) - A comparison of self-supervised speech representations as input features
for unsupervised acoustic word embeddings [32.59716743279858]
We look at representation learning at the short-time frame level.
Recent approaches include self-supervised predictive coding and correspondence autoencoder (CAE) models.
We compare frame-level features from contrastive predictive coding ( CPC), autoregressive predictive coding and a CAE to conventional MFCCs.
arXiv Detail & Related papers (2020-12-14T10:17:25Z) - Constructing interval variables via faceted Rasch measurement and
multitask deep learning: a hate speech application [63.10266319378212]
We propose a method for measuring complex variables on a continuous, interval spectrum by combining supervised deep learning with the Constructing Measures approach to faceted Rasch item response theory (IRT)
We demonstrate this new method on a dataset of 50,000 social media comments sourced from YouTube, Twitter, and Reddit and labeled by 11,000 U.S.-based Amazon Mechanical Turk workers.
arXiv Detail & Related papers (2020-09-22T02:15:05Z) - Continuous speech separation: dataset and analysis [52.10378896407332]
In natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components.
This paper describes a dataset and protocols for evaluating continuous speech separation algorithms.
arXiv Detail & Related papers (2020-01-30T18:01:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.