A knowledge-driven vowel-based approach of depression classification
from speech using data augmentation
- URL: http://arxiv.org/abs/2210.15261v1
- Date: Thu, 27 Oct 2022 08:34:08 GMT
- Title: A knowledge-driven vowel-based approach of depression classification
from speech using data augmentation
- Authors: Kexin Feng and Theodora Chaspari
- Abstract summary: We propose a novel explainable machine learning (ML) model that identifies depression from speech.
Our method first models the variable-length utterances at the local-level into a fixed-size vowel-based embedding.
depression is classified at the global-level from a group of vowel CNN embeddings that serve as the input of another 1D CNN.
- Score: 10.961439164833891
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel explainable machine learning (ML) model that identifies
depression from speech, by modeling the temporal dependencies across utterances
and utilizing the spectrotemporal information at the vowel level. Our method
first models the variable-length utterances at the local-level into a
fixed-size vowel-based embedding using a convolutional neural network with a
spatial pyramid pooling layer ("vowel CNN"). Following that, the depression is
classified at the global-level from a group of vowel CNN embeddings that serve
as the input of another 1D CNN ("depression CNN"). Different data augmentation
methods are designed for both the training of vowel CNN and depression CNN. We
investigate the performance of the proposed system at various temporal
granularities when modeling short, medium, and long analysis windows,
corresponding to 10, 21, and 42 utterances, respectively. The proposed method
reaches comparable performance with previous state-of-the-art approaches and
depicts explainable properties with respect to the depression outcome. The
findings from this work may benefit clinicians by providing additional
intuitions during joint human-ML decision-making tasks.
Related papers
- HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - Toward Knowledge-Driven Speech-Based Models of Depression: Leveraging
Spectrotemporal Variations in Speech Vowels [10.961439164833891]
Psychomotor retardation associated with depression has been linked with tangible differences in vowel production.
This paper investigates a knowledge-driven machine learning (ML) method that integrates spectrotemporal information of speech at the vowel-level to identify the depression.
arXiv Detail & Related papers (2022-10-05T19:57:53Z) - A Unified Understanding of Deep NLP Models for Text Classification [88.35418976241057]
We have developed a visual analysis tool, DeepNLPVis, to enable a unified understanding of NLP models for text classification.
The key idea is a mutual information-based measure, which provides quantitative explanations on how each layer of a model maintains the information of input words in a sample.
A multi-level visualization, which consists of a corpus-level, a sample-level, and a word-level visualization, supports the analysis from the overall training set to individual samples.
arXiv Detail & Related papers (2022-06-19T08:55:07Z) - Multimodal Depression Classification Using Articulatory Coordination
Features And Hierarchical Attention Based Text Embeddings [4.050982413149992]
We develop a multimodal depression classification system using arttory coordination features extracted from vocal tract variables and text transcriptions.
The system is developed by combining embeddings from the session-level audio model and the HAN text model.
arXiv Detail & Related papers (2022-02-13T07:37:09Z) - Keypoint Message Passing for Video-based Person Re-Identification [106.41022426556776]
Video-based person re-identification (re-ID) is an important technique in visual surveillance systems which aims to match video snippets of people captured by different cameras.
Existing methods are mostly based on convolutional neural networks (CNNs), whose building blocks either process local neighbor pixels at a time, or, when 3D convolutions are used to model temporal information, suffer from the misalignment problem caused by person movement.
In this paper, we propose to overcome the limitations of normal convolutions with a human-oriented graph method. Specifically, features located at person joint keypoints are extracted and connected as a spatial-temporal graph
arXiv Detail & Related papers (2021-11-16T08:01:16Z) - Preliminary study on using vector quantization latent spaces for TTS/VC
systems with consistent performance [55.10864476206503]
We investigate the use of quantized vectors to model the latent linguistic embedding.
By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding.
Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations.
arXiv Detail & Related papers (2021-06-25T07:51:35Z) - Interpreting intermediate convolutional layers of CNNs trained on raw
speech [0.0]
We show that averaging over feature maps after ReLU activation in each convolutional layer yields interpretable time-series data.
The proposed technique enables acoustic analysis of intermediate convolutional layers.
arXiv Detail & Related papers (2021-04-19T17:52:06Z) - Multi-Modal Detection of Alzheimer's Disease from Speech and Text [3.702631194466718]
We propose a deep learning method that utilizes speech and the corresponding transcript simultaneously to detect Alzheimer's disease (AD)
The proposed method achieves 85.3% 10-fold cross-validation accuracy when trained and evaluated on the Dementiabank Pitt corpus.
arXiv Detail & Related papers (2020-11-30T21:18:17Z) - Correlation based Multi-phasal models for improved imagined speech EEG
recognition [22.196642357767338]
This work aims to profit from the parallel information contained in multi-phasal EEG data recorded while speaking, imagining and performing articulatory movements corresponding to specific speech units.
A bi-phase common representation learning module using neural networks is designed to model the correlation and between an analysis phase and a support phase.
The proposed approach further handles the non-availability of multi-phasal data during decoding.
arXiv Detail & Related papers (2020-11-04T09:39:53Z) - Video-based Facial Expression Recognition using Graph Convolutional
Networks [57.980827038988735]
We introduce a Graph Convolutional Network (GCN) layer into a common CNN-RNN based model for video-based facial expression recognition.
We evaluate our method on three widely-used datasets, CK+, Oulu-CASIA and MMI, and also one challenging wild dataset AFEW8.0.
arXiv Detail & Related papers (2020-10-26T07:31:51Z) - Mechanisms for Handling Nested Dependencies in Neural-Network Language
Models and Humans [75.15855405318855]
We studied whether a modern artificial neural network trained with "deep learning" methods mimics a central aspect of human sentence processing.
Although the network was solely trained to predict the next word in a large corpus, analysis showed the emergence of specialized units that successfully handled local and long-distance syntactic agreement.
We tested the model's predictions in a behavioral experiment where humans detected violations in number agreement in sentences with systematic variations in the singular/plural status of multiple nouns.
arXiv Detail & Related papers (2020-06-19T12:00:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.