Related papers: When LLMs Meets Acoustic Landmarks: An Efficient Approach to Integrate Speech into Large Language Models for Depression Detection

When LLMs Meets Acoustic Landmarks: An Efficient Approach to Integrate Speech into Large Language Models for Depression Detection

URL: http://arxiv.org/abs/2402.13276v2
Date: Mon, 23 Sep 2024 22:54:05 GMT
Title: When LLMs Meets Acoustic Landmarks: An Efficient Approach to Integrate Speech into Large Language Models for Depression Detection
Authors: Xiangyu Zhang, Hexin Liu, Kaishuai Xu, Qiquan Zhang, Daijiao Liu, Beena Ahmed, Julien Epps,
Abstract summary: Depression is a critical concern in global mental health, prompting extensive research into AI-based detection methods. Large Language Models (LLMs) stand out for their versatility in mental healthcare applications. We present an innovative approach to integrating acoustic speech information into the LLMs framework for multimodal depression detection.
Score: 17.018248242646365
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Depression is a critical concern in global mental health, prompting extensive research into AI-based detection methods. Among various AI technologies, Large Language Models (LLMs) stand out for their versatility in mental healthcare applications. However, their primary limitation arises from their exclusive dependence on textual input, which constrains their overall capabilities. Furthermore, the utilization of LLMs in identifying and analyzing depressive states is still relatively untapped. In this paper, we present an innovative approach to integrating acoustic speech information into the LLMs framework for multimodal depression detection. We investigate an efficient method for depression detection by integrating speech signals into LLMs utilizing Acoustic Landmarks. By incorporating acoustic landmarks, which are specific to the pronunciation of spoken words, our method adds critical dimensions to text transcripts. This integration also provides insights into the unique speech patterns of individuals, revealing the potential mental states of individuals. Evaluations of the proposed approach on the DAIC-WOZ dataset reveal state-of-the-art results when compared with existing Audio-Text baselines. In addition, this approach is not only valuable for the detection of depression but also represents a new perspective in enhancing the ability of LLMs to comprehend and process speech signals.

Related papers

SpeechIQ: Speech Intelligence Quotient Across Cognitive Levels in Voice Understanding Large Language Models [76.07833875692722]
Speech-based Intelligence Quotient (SIQ) is a new form of human cognition-inspired evaluation pipeline for voice understanding large language models.<n>Our framework represents a first-of-its-kind intelligence examination that bridges cognitive principles with voice-oriented benchmarks.
arXiv Detail & Related papers (2025-07-25T15:12:06Z)
Read Quietly, Think Aloud: Decoupling Comprehension and Reasoning in LLMs [3.153044931505783]
Large Language Models (LLMs) have demonstrated remarkable proficiency in understanding text and generating high-quality responses.<n>This paper investigates methods to imbue LLMs with a similar capacity for internal processing.
arXiv Detail & Related papers (2025-07-04T06:23:06Z)
Zero-Shot Cognitive Impairment Detection from Speech Using AudioLLM [9.84961079811343]
Speech has gained attention as a non-invasive and easily collectible biomarker for assessing cognitive decline.<n>Traditional cognitive impairment detection methods rely on supervised models trained on acoustic and linguistic features extracted from speech.<n>We propose the first zero-shot speech-based CI detection method using the Qwen2- Audio AudioLLM, a model capable of processing both audio and text inputs.
arXiv Detail & Related papers (2025-06-20T01:28:43Z)
Incorporating Linguistic Constraints from External Knowledge Source for Audio-Visual Target Speech Extraction [87.49303116989708]
We explore the potential of pre-trained speech-language models (PSLMs) and pre-trained language models (PLMs) as auxiliary knowledge sources for AV-TSE.<n>In this study, we propose incorporating the linguistic constraints from PSLMs or PLMs for the AV-TSE model as additional supervision signals.<n>Without any extra computational cost during inference, the proposed approach consistently improves speech quality and intelligibility.
arXiv Detail & Related papers (2025-06-11T14:36:26Z)
Large Language Models for Depression Recognition in Spoken Language Integrating Psychological Knowledge [38.52606718906513]
Depression is a growing concern gaining attention in both public discourse and AI research.<n>We present to the best of our knowledge, the first application of large language models to multimodal depression detection.<n>We propose a novel strategy for incorporating psychological knowledge into LLMs to enhance diagnostic performance.
arXiv Detail & Related papers (2025-05-28T20:53:05Z)
Dementia Insights: A Context-Based MultiModal Approach [0.3749861135832073]
Early detection is crucial for timely interventions that may slow disease progression. Large pre-trained models (LPMs) for text and audio have shown promise in identifying cognitive impairments. This study proposes a context-based multimodal method, integrating both text and audio data using the best-performing LPMs.
arXiv Detail & Related papers (2025-03-03T06:46:26Z)
DECT: Harnessing LLM-assisted Fine-Grained Linguistic Knowledge and Label-Switched and Label-Preserved Data Generation for Diagnosis of Alzheimer's Disease [13.38075448636078]
Alzheimer's Disease (AD) is an irreversible neurodegenerative disease affecting 50 million people worldwide. Language impairment is one of the earliest signs of cognitive decline, which can be used to discriminate AD patients from normal control individuals. Patient-interviewer dialogues may be used to detect such impairments, but they are often mixed with ambiguous, noisy, and irrelevant information.
arXiv Detail & Related papers (2025-02-06T04:00:25Z)
LlaMADRS: Prompting Large Language Models for Interview-Based Depression Assessment [75.44934940580112]
This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment. We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews. Our approach, tested on 236 real-world interviews, demonstrates strong correlations with clinician assessments.
arXiv Detail & Related papers (2025-01-07T08:49:04Z)
Roadmap towards Superhuman Speech Understanding using Large Language Models [60.57947401837938]
Large language models (LLMs) integrate speech and audio data. Recent advances, such as GPT-4o, highlight the potential for end-to-end speech LLMs. We propose a five-level roadmap, ranging from basic automatic speech recognition (ASR) to advanced superhuman models.
arXiv Detail & Related papers (2024-10-17T06:44:06Z)
Where are we in audio deepfake detection? A systematic analysis over generative and detection models [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark. It provides a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content. It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z)
Language-Agnostic Analysis of Speech Depression Detection [2.5764071253486636]
This work analyzes automatic speech-based depression detection across two languages, English and Malayalam. A CNN model is trained to identify acoustic features associated with depression in speech, focusing on both languages. Our findings and collected data could contribute to the development of language-agnostic speech-based depression detection systems.
arXiv Detail & Related papers (2024-09-23T07:35:56Z)
Human Speech Perception in Noise: Can Large Language Models Paraphrase to Improve It? [26.835947209927273]
Large Language Models (LLMs) can generate text by transferring style attributes like formality resulting in formal or informal text. We conduct the first study to evaluate LLMs on a novel task of generating acoustically intelligible paraphrases for better human speech perception in noise. Our approach resulted in a 40% relative improvement in human speech perception, by paraphrasing utterances that are highly distorted in a listening condition with babble noise at a signal-to-noise ratio (SNR) -5 dB.
arXiv Detail & Related papers (2024-08-07T18:24:23Z)
Can LLMs Understand the Implication of Emphasized Sentences in Dialogue? [64.72966061510375]
Emphasis is a crucial component in human communication, which indicates the speaker's intention and implication beyond pure text in dialogue. This paper introduces Emphasized-Talk, a benchmark with emphasis-annotated dialogue samples capturing the implications of emphasis. We evaluate various Large Language Models (LLMs), both open-source and commercial, to measure their performance in understanding emphasis.
arXiv Detail & Related papers (2024-06-16T20:41:44Z)
Speech-based Clinical Depression Screening: An Empirical Study [32.84863235794086]
This study investigates the utility of speech signals for AI-based depression screening across varied interaction scenarios. participants include depressed patients recruited from the outpatient clinics of Peking University Sixth Hospital. We extracted acoustic and deep speech features from each participant's segmented recordings.
arXiv Detail & Related papers (2024-06-05T09:43:54Z)
It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition [70.77292069313154]
Large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output. In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF)
arXiv Detail & Related papers (2024-02-08T07:21:45Z)
Identification of Cognitive Decline from Spoken Language through Feature Selection and the Bag of Acoustic Words Model [0.0]
The early identification of symptoms of memory disorders plays a significant role in ensuring the well-being of populations. The lack of standardized speech tests in clinical settings has led to a growing emphasis on developing automatic machine learning techniques for analyzing naturally spoken language. The work presents an approach related to feature selection, allowing for the automatic selection of the essential features required for diagnosis from the Geneva minimalistic acoustic parameter set and relative speech pauses.
arXiv Detail & Related papers (2024-02-02T17:06:03Z)
Enhancing Uncertainty-Based Hallucination Detection with Stronger Focus [99.33091772494751]
Large Language Models (LLMs) have gained significant popularity for their impressive performance across diverse fields. LLMs are prone to hallucinate untruthful or nonsensical outputs that fail to meet user expectations. We propose a novel reference-free, uncertainty-based method for detecting hallucinations in LLMs.
arXiv Detail & Related papers (2023-11-22T08:39:17Z)
DEPAC: a Corpus for Depression and Anxiety Detection from Speech [3.2154432166999465]
We introduce a novel mental distress analysis audio dataset DEPAC, labeled based on established thresholds on depression and anxiety screening tools. This large dataset comprises multiple speech tasks per individual, as well as relevant demographic information. We present a feature set consisting of hand-curated acoustic and linguistic features, which were found effective in identifying signs of mental illnesses in human speech.
arXiv Detail & Related papers (2023-06-20T12:21:06Z)
Leveraging Pretrained Representations with Task-related Keywords for Alzheimer's Disease Detection [69.53626024091076]
Alzheimer's disease (AD) is particularly prominent in older adults. Recent advances in pre-trained models motivate AD detection modeling to shift from low-level features to high-level representations. This paper presents several efficient methods to extract better AD-related cues from high-level acoustic and linguistic features.
arXiv Detail & Related papers (2023-03-14T16:03:28Z)
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks. Traditionally, these tasks have been tackled using signal processing and machine learning techniques. Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.