A Joint Spectro-Temporal Relational Thinking Based Acoustic Modeling Framework
- URL: http://arxiv.org/abs/2409.15357v1
- Date: Tue, 17 Sep 2024 05:45:33 GMT
- Title: A Joint Spectro-Temporal Relational Thinking Based Acoustic Modeling Framework
- Authors: Zheng Nan, Ting Dang, Vidhyasaharan Sethu, Beena Ahmed,
- Abstract summary: Despite the crucial role relational thinking plays in human understanding of speech, it has yet to be leveraged in any artificial speech recognition systems.
This paper presents a novel spectro-temporal relational thinking based acoustic modeling framework.
Models built upon this framework outperform stateof-the-art systems with a 7.82% improvement in phoneme recognition tasks over the TIMIT dataset.
- Score: 10.354955365036181
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Relational thinking refers to the inherent ability of humans to form mental impressions about relations between sensory signals and prior knowledge, and subsequently incorporate them into their model of their world. Despite the crucial role relational thinking plays in human understanding of speech, it has yet to be leveraged in any artificial speech recognition systems. Recently, there have been some attempts to correct this oversight, but these have been limited to coarse utterance-level models that operate exclusively in the time domain. In an attempt to narrow the gap between artificial systems and human abilities, this paper presents a novel spectro-temporal relational thinking based acoustic modeling framework. Specifically, it first generates numerous probabilistic graphs to model the relationships among speech segments across both time and frequency domains. The relational information rooted in every pair of nodes within these graphs is then aggregated and embedded into latent representations that can be utilized by downstream tasks. Models built upon this framework outperform state-of-the-art systems with a 7.82\% improvement in phoneme recognition tasks over the TIMIT dataset. In-depth analyses further reveal that our proposed relational thinking modeling mainly improves the model's ability to recognize vowels, which are the most likely to be confused by phoneme recognizers.
Related papers
- Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models [65.82564074712836]
We introduce DIFfusionHOI, a new HOI detector shedding light on text-to-image diffusion models.
We first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space.
These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions.
arXiv Detail & Related papers (2024-10-26T12:00:33Z) - Evaluating Speaker Identity Coding in Self-supervised Models and Humans [0.42303492200814446]
Speaker identity plays a significant role in human communication and is being increasingly used in societal applications.
We show that self-supervised representations from different families are significantly better for speaker identification over acoustic representations.
We also show that such a speaker identification task can be used to better understand the nature of acoustic information representation in different layers of these powerful networks.
arXiv Detail & Related papers (2024-06-14T20:07:21Z) - Capturing Spectral and Long-term Contextual Information for Speech
Emotion Recognition Using Deep Learning Techniques [0.0]
This research proposes an ensemble model that combines Graph Convolutional Networks (GCN) for processing textual data and the HuBERT transformer for analyzing audio signals.
By combining GCN and HuBERT, our ensemble model can leverage the strengths of both approaches.
Results indicate that the combined model can overcome the limitations of traditional methods, leading to enhanced accuracy in recognizing emotions from speech.
arXiv Detail & Related papers (2023-08-04T06:20:42Z) - Relational Temporal Graph Reasoning for Dual-task Dialogue Language
Understanding [39.76268402567324]
Dual-task dialog understanding language aims to tackle two correlative dialog language understanding tasks simultaneously via their inherent correlations.
We put forward a new framework, whose core is relational temporal graph reasoning.
Our models outperform state-of-the-art models by a large margin.
arXiv Detail & Related papers (2023-06-15T13:19:08Z) - Probing self-supervised speech models for phonetic and phonemic
information: a case study in aspiration [17.94683764469626]
We evaluate the extent to which these models' learned representations align with basic representational distinctions made by humans.
We find that robust representations of both phonetic and phonemic distinctions emerge in early layers of these models' architectures.
Our findings show that speech-trained HuBERT derives a low-noise and low-dimensional subspace corresponding to abstract phonological distinctions.
arXiv Detail & Related papers (2023-06-09T20:07:22Z) - Message Intercommunication for Inductive Relation Reasoning [49.731293143079455]
We develop a novel inductive relation reasoning model called MINES.
We introduce a Message Intercommunication mechanism on the Neighbor-Enhanced Subgraph.
Our experiments show that MINES outperforms existing state-of-the-art models.
arXiv Detail & Related papers (2023-05-23T13:51:46Z) - Towards Disentangled Speech Representations [65.7834494783044]
We construct a representation learning task based on joint modeling of ASR and TTS.
We seek to learn a representation of audio that disentangles that part of the speech signal that is relevant to transcription from that part which is not.
We show that enforcing these properties during training improves WER by 24.5% relative on average for our joint modeling task.
arXiv Detail & Related papers (2022-08-28T10:03:55Z) - Temporal Relevance Analysis for Video Action Models [70.39411261685963]
We first propose a new approach to quantify the temporal relationships between frames captured by CNN-based action models.
We then conduct comprehensive experiments and in-depth analysis to provide a better understanding of how temporal modeling is affected.
arXiv Detail & Related papers (2022-04-25T19:06:48Z) - On the benefits of robust models in modulation recognition [53.391095789289736]
Deep Neural Networks (DNNs) using convolutional layers are state-of-the-art in many tasks in communications.
In other domains, like image classification, DNNs have been shown to be vulnerable to adversarial perturbations.
We propose a novel framework to test the robustness of current state-of-the-art models.
arXiv Detail & Related papers (2021-03-27T19:58:06Z) - Multi-view Temporal Alignment for Non-parallel Articulatory-to-Acoustic
Speech Synthesis [59.623780036359655]
Articulatory-to-acoustic (A2A) synthesis refers to the generation of audible speech from captured movement of the speech articulators.
This technique has numerous applications, such as restoring oral communication to people who cannot longer speak due to illness or injury.
We propose a solution to this problem based on the theory of multi-view learning.
arXiv Detail & Related papers (2020-12-30T15:09:02Z) - Deep Graph Random Process for Relational-Thinking-Based Speech
Recognition [12.09786458466155]
relational thinking is characterized by relying on innumerable unconscious percepts pertaining to relations between new sensory signals and prior knowledge.
We present a Bayesian nonparametric deep learning method called deep graph random process (DGP) that can generate an infinite number of probabilistic graphs representing percepts.
Our approach is able to successfully infer relations among utterances without using any relational data during training.
arXiv Detail & Related papers (2020-07-04T15:27:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.