Related papers: Systematic Evaluation of Machine-Generated Reasoning and PHQ-9 Labeling for Depression Detection Using Large Language Models

Systematic Evaluation of Machine-Generated Reasoning and PHQ-9 Labeling for Depression Detection Using Large Language Models

URL: http://arxiv.org/abs/2505.17119v1
Date: Wed, 21 May 2025 16:30:50 GMT
Title: Systematic Evaluation of Machine-Generated Reasoning and PHQ-9 Labeling for Depression Detection Using Large Language Models
Authors: Zongru Shao, Xin Wang, Zhanyang Liu, Chenhan Wang, K. P. Subbalakshmi,
Abstract summary: Large language models (LLMs) for early mental health detection, such as depression, are often optimized with machine-generated data.<n>Here, we provide a systematic evaluation of the reasoning over machine-generated detection and interpretation.<n>We then use the models' reasoning abilities to explore mitigation strategies for enhanced performance.
Score: 5.426680341952808
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent research leverages large language models (LLMs) for early mental health detection, such as depression, often optimized with machine-generated data. However, their detection may be subject to unknown weaknesses. Meanwhile, quality control has not been applied to these generated corpora besides limited human verifications. Our goal is to systematically evaluate LLM reasoning and reveal potential weaknesses. To this end, we first provide a systematic evaluation of the reasoning over machine-generated detection and interpretation. Then we use the models' reasoning abilities to explore mitigation strategies for enhanced performance. Specifically, we do the following: A. Design an LLM instruction strategy that allows for systematic analysis of the detection by breaking down the task into several subtasks. B. Design contrastive few-shot and chain-of-thought prompts by selecting typical positive and negative examples of detection reasoning. C. Perform human annotation for the subtasks identified in the first step and evaluate the performance. D. Identify human-preferred detection with desired logical reasoning from the few-shot generation and use them to explore different optimization strategies. We conducted extensive comparisons on the DepTweet dataset across the following subtasks: 1. identifying whether the speaker is describing their own depression; 2. accurately detecting the presence of PHQ-9 symptoms, and 3. finally, detecting depression. Human verification of statistical outliers shows that LLMs demonstrate greater accuracy in analyzing and detecting explicit language of depression as opposed to implicit expressions of depression. Two optimization methods are used for performance enhancement and reduction of the statistic bias: supervised fine-tuning (SFT) and direct preference optimization (DPO). Notably, the DPO approach achieves significant performance improvement.

Related papers

AGIR: Assessing 3D Gait Impairment with Reasoning based on LLMs [0.0]
gait impairment plays an important role in early diagnosis, disease monitoring, and treatment evaluation for neurodegenerative diseases.<n>Recent deep learning-based approaches have consistently improved classification accuracies, but they often lack interpretability.<n>We introduce AGIR, a novel pipeline consisting of a pre-trained VQ-VAE motion tokenizer and a Large Language Model (LLM) fine-tuned over pairs of motion tokens.
arXiv Detail & Related papers (2025-03-23T17:12:16Z)
Generating Medically-Informed Explanations for Depression Detection using LLMs [1.325953054381901]
Early detection of depression from social media data offers a valuable opportunity for timely intervention.<n>We propose LLM-MTD (Large Language Model for Multi-Task Depression Detection), a novel approach that combines the power of large language models with the crucial aspect of explainability.
arXiv Detail & Related papers (2025-03-18T19:23:22Z)
Dementia Insights: A Context-Based MultiModal Approach [0.3749861135832073]
Early detection is crucial for timely interventions that may slow disease progression.<n>Large pre-trained models (LPMs) for text and audio have shown promise in identifying cognitive impairments.<n>This study proposes a context-based multimodal method, integrating both text and audio data using the best-performing LPMs.
arXiv Detail & Related papers (2025-03-03T06:46:26Z)
Enhancing Depression Detection with Chain-of-Thought Prompting: From Emotion to Reasoning Using Large Language Models [9.43184936918456]
Depression is one of the leading causes of disability worldwide.<n>Recent advancements in Large Language Models have shown promise in addressing mental health challenges.<n>We propose a Chain-of-Thought Prompting approach that enhances both the performance and interpretability of depression detection.
arXiv Detail & Related papers (2025-02-09T12:30:57Z)
On the Within-class Variation Issue in Alzheimer's Disease Detection [60.08015780474457]
Alzheimer's Disease (AD) detection employs machine learning classification models to distinguish between individuals with AD and those without.<n>In this work, we found using a sample score estimator can generate sample-specific soft scores aligning with cognitive scores.<n>We propose two simple yet effective methods: Soft Target Distillation (SoTD) and Instance-level Re-balancing (InRe)
arXiv Detail & Related papers (2024-09-22T02:06:05Z)
A BERT-Based Summarization approach for depression detection [1.7363112470483526]
Depression is a globally prevalent mental disorder with potentially severe repercussions if not addressed. Machine learning and artificial intelligence can autonomously detect depression indicators from diverse data sources. Our study proposes text summarization as a preprocessing technique to diminish the length and intricacies of input texts.
arXiv Detail & Related papers (2024-09-13T02:14:34Z)
Resultant: Incremental Effectiveness on Likelihood for Unsupervised Out-of-Distribution Detection [63.93728560200819]
Unsupervised out-of-distribution (U-OOD) detection is to identify data samples with a detector trained solely on unlabeled in-distribution (ID) data. Recent studies have developed various detectors based on DGMs to move beyond likelihood. We apply two techniques for each direction, specifically post-hoc prior and dataset entropy-mutual calibration. Experimental results demonstrate that the Resultant could be a new state-of-the-art U-OOD detector.
arXiv Detail & Related papers (2024-09-05T02:58:13Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks. This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs. We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z)
A New Benchmark and Reverse Validation Method for Passage-level Hallucination Detection [63.56136319976554]
Large Language Models (LLMs) generate hallucinations, which can cause significant damage when deployed for mission-critical tasks. We propose a self-check approach based on reverse validation to detect factual errors automatically in a zero-resource fashion. We empirically evaluate our method and existing zero-resource detection methods on two datasets.
arXiv Detail & Related papers (2023-10-10T10:14:59Z)
MGTBench: Benchmarking Machine-Generated Text Detection [54.81446366272403]
This paper proposes the first benchmark framework for MGT detection against powerful large language models (LLMs) We show that a larger number of words in general leads to better performance and most detection methods can achieve similar performance with much fewer training samples. Our findings indicate that the model-based detection methods still perform well in the text attribution task.
arXiv Detail & Related papers (2023-03-26T21:12:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.