Multimodal Depression Classification Using Articulatory Coordination
Features And Hierarchical Attention Based Text Embeddings
- URL: http://arxiv.org/abs/2202.06238v1
- Date: Sun, 13 Feb 2022 07:37:09 GMT
- Title: Multimodal Depression Classification Using Articulatory Coordination
Features And Hierarchical Attention Based Text Embeddings
- Authors: Nadee Seneviratne, Carol Espy-Wilson
- Abstract summary: We develop a multimodal depression classification system using arttory coordination features extracted from vocal tract variables and text transcriptions.
The system is developed by combining embeddings from the session-level audio model and the HAN text model.
- Score: 4.050982413149992
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal depression classification has gained immense popularity over the
recent years. We develop a multimodal depression classification system using
articulatory coordination features extracted from vocal tract variables and
text transcriptions obtained from an automatic speech recognition tool that
yields improvements of area under the receiver operating characteristics curve
compared to uni-modal classifiers (7.5% and 13.7% for audio and text
respectively). We show that in the case of limited training data, a
segment-level classifier can first be trained to then obtain a session-wise
prediction without hindering the performance, using a multi-stage convolutional
recurrent neural network. A text model is trained using a Hierarchical
Attention Network (HAN). The multimodal system is developed by combining
embeddings from the session-level audio model and the HAN text model
Related papers
- Multi-modal Adversarial Training for Zero-Shot Voice Cloning [9.823246184635103]
We propose a Transformer encoder-decoder architecture to conditionally discriminate between real and generated speech features.
We introduce our novel adversarial training technique by applying it to a FastSpeech2 acoustic model and training on Libriheavy, a large multi-speaker dataset.
Our model achieves improvements over the baseline in terms of speech quality and speaker similarity.
arXiv Detail & Related papers (2024-08-28T16:30:41Z) - VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Unsupervised Improvement of Audio-Text Cross-Modal Representations [19.960695758478153]
We study unsupervised approaches to improve the learning framework of such representations with unpaired text and audio.
We show that when domain-specific curation is used in conjunction with a soft-labeled contrastive loss, we are able to obtain significant improvement in terms of zero-shot classification performance.
arXiv Detail & Related papers (2023-05-03T02:30:46Z) - A knowledge-driven vowel-based approach of depression classification
from speech using data augmentation [10.961439164833891]
We propose a novel explainable machine learning (ML) model that identifies depression from speech.
Our method first models the variable-length utterances at the local-level into a fixed-size vowel-based embedding.
depression is classified at the global-level from a group of vowel CNN embeddings that serve as the input of another 1D CNN.
arXiv Detail & Related papers (2022-10-27T08:34:08Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Multi-Dialect Arabic Speech Recognition [0.0]
This paper presents the design and development of multi-dialect automatic speech recognition for Arabic.
Deep neural networks are becoming an effective tool to solve sequential data problems.
The proposed system achieved a 14% error rate which outperforms previous systems.
arXiv Detail & Related papers (2021-12-25T20:55:57Z) - Speaker Embedding-aware Neural Diarization for Flexible Number of
Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels.
Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z) - Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding [93.16866430882204]
In prior works, frame-level features from one layer are aggregated to form an utterance-level representation.
Inspired by the Transformer network, our proposed method utilizes the hierarchical architecture of stacked self-attention mechanisms.
With more layers stacked, the neural network can learn more discriminative speaker embeddings.
arXiv Detail & Related papers (2021-07-14T05:38:48Z) - Preliminary study on using vector quantization latent spaces for TTS/VC
systems with consistent performance [55.10864476206503]
We investigate the use of quantized vectors to model the latent linguistic embedding.
By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding.
Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations.
arXiv Detail & Related papers (2021-06-25T07:51:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.