Toward Corpus Size Requirements for Training and Evaluating Depression Risk Models Using Spoken Language
- URL: http://arxiv.org/abs/2501.00617v1
- Date: Tue, 31 Dec 2024 19:32:25 GMT
- Title: Toward Corpus Size Requirements for Training and Evaluating Depression Risk Models Using Spoken Language
- Authors: Tomek Rutowski, Amir Harati, Elizabeth Shriberg, Yang Lu, Piotr Chlebek, Ricardo Oliveira,
- Abstract summary: This study illustrates how variations in test and train set sizes impact performance in a controlled study.<n>Results show that test sizes below 1K samples gave noisy results, even for larger training set sizes.<n>Training set sizes of at least 2K were needed for stable results.
- Score: 7.6109649792432315
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mental health risk prediction is a growing field in the speech community, but many studies are based on small corpora. This study illustrates how variations in test and train set sizes impact performance in a controlled study. Using a corpus of over 65K labeled data points, results from a fully crossed design of different train/test size combinations are provided. Two model types are included: one based on language and the other on speech acoustics. Both use methods current in this domain. An age-mismatched test set was also included. Results show that (1) test sizes below 1K samples gave noisy results, even for larger training set sizes; (2) training set sizes of at least 2K were needed for stable results; (3) NLP and acoustic models behaved similarly with train/test size variations, and (4) the mismatched test set showed the same patterns as the matched test set. Additional factors are discussed, including label priors, model strength and pre-training, unique speakers, and data lengths. While no single study can specify exact size requirements, results demonstrate the need for appropriately sized train and test sets for future studies of mental health risk prediction from speech and language.
Related papers
- Individual Text Corpora Predict Openness, Interests, Knowledge and Level of Education [0.5825410941577593]
Personality dimension of openness to experience can be predicted from the individual google search history.
Individual text corpora (ICs) were generated from 214 participants with a mean number of 5 million word tokens.
arXiv Detail & Related papers (2024-03-29T21:44:24Z) - PETA: Evaluating the Impact of Protein Transfer Learning with Sub-word
Tokenization on Downstream Applications [9.782175445247127]
PETA trained language models with 14 different vocabulary sizes under three tokenization methods.
It conducted thousands of tests on 33 diverse downstream datasets to assess the models' transfer learning capabilities.
Experiments indicate that vocabulary sizes between 50 and 200 optimize the model, whereas sizes exceeding 800 detrimentally affect the model's representational performance.
arXiv Detail & Related papers (2023-10-26T14:20:44Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - Scaling laws for language encoding models in fMRI [47.498241053872924]
We tested whether larger open-source models are better at predicting brain responses recorded using fMRI.
Similar logarithmic behavior was observed when scaling the size of the fMRI training set.
These results suggest that increasing scale in both models and data will yield incredibly effective models of language processing in the brain.
arXiv Detail & Related papers (2023-05-19T17:53:03Z) - Token-Level Fitting Issues of Seq2seq Models [15.81037035729968]
Sequence-to-sequence (seq2seq) models have been widely used for natural language processing, computer vision, and other deep learning tasks.
We find that seq2seq models trained with early-stopping suffer from issues at the token level.
arXiv Detail & Related papers (2023-05-08T06:40:24Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for
Multilingual Speech to Image Retrieval [56.49878599920353]
This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval.
For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages.
arXiv Detail & Related papers (2022-11-02T14:54:45Z) - Segment-level Metric Learning for Few-shot Bioacoustic Event Detection [56.59107110017436]
We propose a segment-level few-shot learning framework that utilizes both the positive and negative events during model optimization.
Our system achieves an F-measure of 62.73 on the DCASE 2022 challenge task 5 (DCASE2022-T5) validation set, outperforming the performance of the baseline prototypical network 34.02 by a large margin.
arXiv Detail & Related papers (2022-07-15T22:41:30Z) - Self-Normalized Importance Sampling for Neural Language Modeling [97.96857871187052]
In this work, we propose self-normalized importance sampling. Compared to our previous work, the criteria considered in this work are self-normalized and there is no need to further conduct a correction step.
We show that our proposed self-normalized importance sampling is competitive in both research-oriented and production-oriented automatic speech recognition tasks.
arXiv Detail & Related papers (2021-11-11T16:57:53Z) - Predicting speech intelligibility from EEG using a dilated convolutional
network [17.56832530408592]
We present a deep-learning-based model incorporating dilated convolutions that can be used to predict speech intelligibility without subject-specific training.
Our method is the first to predict the speech reception threshold from EEG for unseen subjects, contributing to objective measures of speech intelligibility.
arXiv Detail & Related papers (2021-05-14T14:12:52Z) - Unsupervised neural adaptation model based on optimal transport for
spoken language identification [54.96267179988487]
Due to the mismatch of statistical distributions of acoustic speech between training and testing sets, the performance of spoken language identification (SLID) could be drastically degraded.
We propose an unsupervised neural adaptation model to deal with the distribution mismatch problem for SLID.
arXiv Detail & Related papers (2020-12-24T07:37:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.