Exploration of Audio Quality Assessment and Anomaly Localisation Using
Attention Models
- URL: http://arxiv.org/abs/2005.08053v1
- Date: Sat, 16 May 2020 17:54:07 GMT
- Title: Exploration of Audio Quality Assessment and Anomaly Localisation Using
Attention Models
- Authors: Qiang Huang and Thomas Hain
- Abstract summary: In this paper, a novel model for audio quality assessment is proposed by jointly using bidirectional long short-term memory and an attention mechanism.
The former is to mimic a human auditory perception ability to learn information from a recording, and the latter is to further discriminate interferences from desired signals by highlighting target related features.
To evaluate our proposed approach, the TIMIT dataset is used and augmented by mixing with various natural sounds.
- Score: 37.60722440434528
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many applications of speech technology require more and more audio data.
Automatic assessment of the quality of the collected recordings is important to
ensure they meet the requirements of the related applications. However,
effective and high performing assessment remains a challenging task without a
clean reference. In this paper, a novel model for audio quality assessment is
proposed by jointly using bidirectional long short-term memory and an attention
mechanism. The former is to mimic a human auditory perception ability to learn
information from a recording, and the latter is to further discriminate
interferences from desired signals by highlighting target related features. To
evaluate our proposed approach, the TIMIT dataset is used and augmented by
mixing with various natural sounds. In our experiments, two tasks are explored.
The first task is to predict an utterance quality score, and the second is to
identify where an anomalous distortion takes place in a recording. The obtained
results show that the use of our proposed approach outperforms a strong
baseline method and gains about 5% improvements after being measured by three
metrics, Linear Correlation Coefficient and Spearman Rank Correlation
Coefficient, and F1.
Related papers
- Exploring Pathological Speech Quality Assessment with ASR-Powered Wav2Vec2 in Data-Scarce Context [7.567181073057191]
This paper introduces a novel approach where the system learns at the audio level instead of segments despite data scarcity.
It shows that the ASR based Wav2Vec2 model brings the best results and may indicate a strong correlation between ASR and speech quality assessment.
arXiv Detail & Related papers (2024-03-29T13:59:34Z) - A Comparative Study of Perceptual Quality Metrics for Audio-driven
Talking Head Videos [81.54357891748087]
We collect talking head videos generated from four generative methods.
We conduct controlled psychophysical experiments on visual quality, lip-audio synchronization, and head movement naturalness.
Our experiments validate consistency between model predictions and human annotations, identifying metrics that align better with human opinions than widely-used measures.
arXiv Detail & Related papers (2024-03-11T04:13:38Z) - Lightly Weighted Automatic Audio Parameter Extraction for the Quality
Assessment of Consensus Auditory-Perceptual Evaluation of Voice [18.8222742272435]
The proposed method utilizes age, sex, and five audio parameters: jitter, absolute jitter, shimmer, harmonic-to-noise ratio (HNR), and zero crossing.
The result reveals that our approach performs similar to state-of-the-art (SOTA) methods, and outperforms the latent representation obtained by using popular audio pre-trained models.
arXiv Detail & Related papers (2023-11-27T07:19:22Z) - Leveraging Pretrained Representations with Task-related Keywords for
Alzheimer's Disease Detection [69.53626024091076]
Alzheimer's disease (AD) is particularly prominent in older adults.
Recent advances in pre-trained models motivate AD detection modeling to shift from low-level features to high-level representations.
This paper presents several efficient methods to extract better AD-related cues from high-level acoustic and linguistic features.
arXiv Detail & Related papers (2023-03-14T16:03:28Z) - Ontology-aware Learning and Evaluation for Audio Tagging [56.59107110017436]
Mean average precision (mAP) metric treats different kinds of sound as independent classes without considering their relations.
Ontology-aware mean average precision (OmAP) addresses the weaknesses of mAP by utilizing the AudioSet ontology information during the evaluation.
We conduct human evaluations and demonstrate that OmAP is more consistent with human perception than mAP.
arXiv Detail & Related papers (2022-11-22T11:35:14Z) - ReAct: Temporal Action Detection with Relational Queries [84.76646044604055]
This work aims at advancing temporal action detection (TAD) using an encoder-decoder framework with action queries.
We first propose a relational attention mechanism in the decoder, which guides the attention among queries based on their relations.
Lastly, we propose to predict the localization quality of each action query at inference in order to distinguish high-quality queries.
arXiv Detail & Related papers (2022-07-14T17:46:37Z) - Objective hearing threshold identification from auditory brainstem
response measurements using supervised and self-supervised approaches [1.0627340704073347]
We develop and compare two methods for automated hearing threshold identification from averaged ABR raw data.
We show that both models work well, outperform human threshold detection, and are suitable for fast, reliable, and unbiased hearing threshold detection and quality control.
arXiv Detail & Related papers (2021-12-16T15:24:31Z) - Comparing Acoustic-based Approaches for Alzheimer's Disease Detection [8.360862198568967]
We study the performance and generalizability of three approaches for AD detection from speech on the recent ADReSSo challenge dataset.
We find that while feature-based approaches have a higher precision, classification approaches relying on the combination of embeddings and features prove to have a higher, and more balanced performance across multiple metrics of performance.
arXiv Detail & Related papers (2021-06-03T02:44:40Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z) - Robust Speaker Recognition Using Speech Enhancement And Attention Model [37.33388614967888]
Instead of individually processing speech enhancement and speaker recognition, the two modules are integrated into one framework by a joint optimisation using deep neural networks.
To increase robustness against noise, a multi-stage attention mechanism is employed to highlight the speaker related features learned from context information in time and frequency domain.
The obtained results show that the proposed approach using speech enhancement and multi-stage attention models outperforms two strong baselines not using them in most acoustic conditions in our experiments.
arXiv Detail & Related papers (2020-01-14T20:03:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.