Beyond Scale: Small Language Models are Comparable to GPT-4 in Mental Health Understanding
- URL: http://arxiv.org/abs/2507.08031v2
- Date: Mon, 14 Jul 2025 01:10:56 GMT
- Title: Beyond Scale: Small Language Models are Comparable to GPT-4 in Mental Health Understanding
- Authors: Hong Jia, Shiya Fu, Feng Xia, Vassilis Kostakos, Ting Dang,
- Abstract summary: Small Language Models (SLMs) are privacy-preserving alternatives to Large Language Models (LLMs)<n>This paper investigates the mental health understanding capabilities of current SLMs through systematic evaluation across classification tasks.<n>Our work highlights the potential of SLMs in mental health understanding, showing they can be effective privacy-preserving tools for analyzing sensitive online text data.
- Score: 12.703061322251093
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The emergence of Small Language Models (SLMs) as privacy-preserving alternatives for sensitive applications raises a fundamental question about their inherent understanding capabilities compared to Large Language Models (LLMs). This paper investigates the mental health understanding capabilities of current SLMs through systematic evaluation across diverse classification tasks. Employing zero-shot and few-shot learning paradigms, we benchmark their performance against established LLM baselines to elucidate their relative strengths and limitations in this critical domain. We assess five state-of-the-art SLMs (Phi-3, Phi-3.5, Qwen2.5, Llama-3.2, Gemma2) against three LLMs (GPT-4, FLAN-T5-XXL, Alpaca-7B) on six mental health understanding tasks. Our findings reveal that SLMs achieve mean performance within 2\% of LLMs on binary classification tasks (F1 scores of 0.64 vs 0.66 in zero-shot settings), demonstrating notable competence despite orders of magnitude fewer parameters. Both model categories experience similar degradation on multi-class severity tasks (a drop of over 30\%), suggesting that nuanced clinical understanding challenges transcend model scale. Few-shot prompting provides substantial improvements for SLMs (up to 14.6\%), while LLM gains are more variable. Our work highlights the potential of SLMs in mental health understanding, showing they can be effective privacy-preserving tools for analyzing sensitive online text data. In particular, their ability to quickly adapt and specialize with minimal data through few-shot learning positions them as promising candidates for scalable mental health screening tools.
Related papers
- Towards Reasoning Ability of Small Language Models [3.732224317444325]
We show that small language models (SLMs) can achieve competitive reasoning performance.<n>We systematically survey, benchmark, and analyze 72 SLMs from six model families across 14 reasoning benchmarks.<n>Our findings challenge the assumption that scaling is the only way to achieve strong reasoning.
arXiv Detail & Related papers (2025-02-17T08:59:16Z) - Adaptive Pruning for Large Language Models with Structural Importance Awareness [66.2690963378878]
Large language models (LLMs) have significantly improved language understanding and generation capabilities.<n>LLMs are difficult to deploy on resource-constrained edge devices due to their high computational and storage resource demands.<n>We propose structurally-aware adaptive pruning (SAAP) to significantly reduce the computational and memory costs while maintaining model performance.
arXiv Detail & Related papers (2024-12-19T18:08:04Z) - Understanding the Role of LLMs in Multimodal Evaluation Benchmarks [77.59035801244278]
This paper investigates the role of the Large Language Model (LLM) backbone in Multimodal Large Language Models (MLLMs) evaluation.
Our study encompasses four diverse MLLM benchmarks and eight state-of-the-art MLLMs.
Key findings reveal that some benchmarks allow high performance even without visual inputs and up to 50% of error rates can be attributed to insufficient world knowledge in the LLM backbone.
arXiv Detail & Related papers (2024-10-16T07:49:13Z) - GIVE: Structured Reasoning of Large Language Models with Knowledge Graph Inspired Veracity Extrapolation [108.2008975785364]
Graph Inspired Veracity Extrapolation (GIVE) is a novel reasoning method that merges parametric and non-parametric memories to improve accurate reasoning with minimal external input.<n>GIVE guides the LLM agent to select the most pertinent expert data (observe), engage in query-specific divergent thinking (reflect), and then synthesize this information to produce the final output (speak)
arXiv Detail & Related papers (2024-10-11T03:05:06Z) - 60 Data Points are Sufficient to Fine-Tune LLMs for Question-Answering [50.12622877002846]
Large language models (LLMs) encode extensive world knowledge through pre-training on massive datasets, which can be fine-tuned for the question-answering (QA) task.<n>We categorize supervised fine-tuning (SFT) data based on the extent of knowledge memorized by the pretrained LLMs.<n>Our experiments show that as few as 60 data points during the SFT stage can activate the knowledge encoded during pre-training, enabling LLMs to perform the QA task.
arXiv Detail & Related papers (2024-09-24T07:38:38Z) - Evaluating Large Language Models for Health-Related Text Classification Tasks with Public Social Media Data [3.9459077974367833]
Large language models (LLMs) have demonstrated remarkable success in NLP tasks.
We benchmarked one supervised classic machine learning model based on Support Vector Machines (SVMs), three supervised pretrained language models (PLMs) based on RoBERTa, BERTweet, and SocBERT, and two LLM based classifiers (GPT3.5 and GPT4), across 6 text classification tasks.
Our comprehensive experiments demonstrate that employ-ing data augmentation using LLMs (GPT-4) with relatively small human-annotated data to train lightweight supervised classification models achieves superior results compared to training with human-annotated data
arXiv Detail & Related papers (2024-03-27T22:05:10Z) - Mental-LLM: Leveraging Large Language Models for Mental Health
Prediction via Online Text Data [42.965788205842465]
We present a comprehensive evaluation of multiple large language models (LLMs) on various mental health prediction tasks.
We conduct experiments covering zero-shot prompting, few-shot prompting, and instruction fine-tuning.
Our best-finetuned models, Mental-Alpaca and Mental-FLAN-T5, outperform the best prompt design of GPT-3.5 by 10.9% on balanced accuracy and the best of GPT-4 (250 and 150 times bigger) by 4.8%.
arXiv Detail & Related papers (2023-07-26T06:00:50Z) - Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs [60.61002524947733]
Previous confidence elicitation methods rely on white-box access to internal model information or model fine-tuning.
This leads to a growing need to explore the untapped area of black-box approaches for uncertainty estimation.
We define a systematic framework with three components: prompting strategies for eliciting verbalized confidence, sampling methods for generating multiple responses, and aggregation techniques for computing consistency.
arXiv Detail & Related papers (2023-06-22T17:31:44Z) - Revisiting the Reliability of Psychological Scales on Large Language Models [62.57981196992073]
This study aims to determine the reliability of applying personality assessments to Large Language Models.
Analysis of 2,500 settings per model, including GPT-3.5, GPT-4, Gemini-Pro, and LLaMA-3.1, reveals that various LLMs show consistency in responses to the Big Five Inventory.
arXiv Detail & Related papers (2023-05-31T15:03:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.