HumBEL: A Human-in-the-Loop Approach for Evaluating Demographic Factors
of Language Models in Human-Machine Conversations
- URL: http://arxiv.org/abs/2305.14195v3
- Date: Mon, 5 Feb 2024 17:28:07 GMT
- Title: HumBEL: A Human-in-the-Loop Approach for Evaluating Demographic Factors
of Language Models in Human-Machine Conversations
- Authors: Anthony Sicilia, Jennifer C. Gates, and Malihe Alikhani
- Abstract summary: We consider how demographic factors in LM language skills can be measured to determine compatibility with a target demographic.
We suggest clinical techniques from Speech Language Pathology, which has norms for acquisition of language skills in humans.
We conduct evaluation with a domain expert (i.e., a clinically licensed speech language pathologist) and also propose automated techniques to complement clinical evaluation at scale.
- Score: 26.59671463642373
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While demographic factors like age and gender change the way people talk, and
in particular, the way people talk to machines, there is little investigation
into how large pre-trained language models (LMs) can adapt to these changes. To
remedy this gap, we consider how demographic factors in LM language skills can
be measured to determine compatibility with a target demographic. We suggest
clinical techniques from Speech Language Pathology, which has norms for
acquisition of language skills in humans. We conduct evaluation with a domain
expert (i.e., a clinically licensed speech language pathologist), and also
propose automated techniques to complement clinical evaluation at scale.
Empirically, we focus on age, finding LM capability varies widely depending on
task: GPT-3.5 mimics the ability of humans ranging from age 6-15 at tasks
requiring inference, and simultaneously, outperforms a typical 21 year old at
memorization. GPT-3.5 also has trouble with social language use, exhibiting
less than 50% of the tested pragmatic skills. Findings affirm the importance of
considering demographic alignment and conversational goals when using LMs as
public-facing tools. Code, data, and a package will be available.
Related papers
- Rel-A.I.: An Interaction-Centered Approach To Measuring Human-LM Reliance [73.19687314438133]
We study how reliance is affected by contextual features of an interaction.
We find that contextual characteristics significantly affect human reliance behavior.
Our results show that calibration and language quality alone are insufficient in evaluating the risks of human-LM interactions.
arXiv Detail & Related papers (2024-07-10T18:00:05Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - Large Language Models Can Infer Psychological Dispositions of Social Media Users [1.0923877073891446]
We test whether GPT-3.5 and GPT-4 can derive the Big Five personality traits from users' Facebook status updates in a zero-shot learning scenario.
Our results show an average correlation of r =.29 (range = [.22,.33]) between LLM-inferred and self-reported trait scores.
predictions were found to be more accurate for women and younger individuals on several traits, suggesting a potential bias stemming from the underlying training data or differences in online self-expression.
arXiv Detail & Related papers (2023-09-13T01:27:48Z) - Cross-Lingual Cross-Age Group Adaptation for Low-Resource Elderly Speech
Emotion Recognition [48.29355616574199]
We analyze the transferability of emotion recognition across three different languages--English, Mandarin Chinese, and Cantonese.
This study concludes that different language and age groups require specific speech features, thus making cross-lingual inference an unsuitable method.
arXiv Detail & Related papers (2023-06-26T08:48:08Z) - Evaluating Language Models for Mathematics through Interactions [116.67206980096513]
We introduce CheckMate, a prototype platform for humans to interact with and evaluate large language models (LLMs)
We conduct a study with CheckMate to evaluate three language models (InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics.
We derive a taxonomy of human behaviours and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness.
arXiv Detail & Related papers (2023-06-02T17:12:25Z) - BabySLM: language-acquisition-friendly benchmark of self-supervised
spoken language models [56.93604813379634]
Self-supervised techniques for learning speech representations have been shown to develop linguistic competence from exposure to speech without the need for human labels.
We propose a language-acquisition-friendly benchmark to probe spoken language models at the lexical and syntactic levels.
We highlight two exciting challenges that need to be addressed for further progress: bridging the gap between text and speech and between clean speech and in-the-wild speech.
arXiv Detail & Related papers (2023-06-02T12:54:38Z) - Assessing Language Disorders using Artificial Intelligence: a Paradigm
Shift [0.13393465195776774]
Speech, language, and communication deficits are present in most neurodegenerative syndromes.
We argue that using machine learning methodologies, natural language processing, and modern artificial intelligence (AI) for Language Assessment is an improvement over conventional manual assessment.
arXiv Detail & Related papers (2023-05-31T17:20:45Z) - Computational Language Acquisition with Theory of Mind [84.2267302901888]
We build language-learning agents equipped with Theory of Mind (ToM) and measure its effects on the learning process.
We find that training speakers with a highly weighted ToM listener component leads to performance gains in our image referential game setting.
arXiv Detail & Related papers (2023-03-02T18:59:46Z) - Can Demographic Factors Improve Text Classification? Revisiting
Demographic Adaptation in the Age of Transformers [34.768337465321395]
Previous work showed that incorporating demographic factors can consistently improve performance for various NLP tasks with traditional NLP models.
We use three common specialization methods proven effective for incorporating external knowledge into pretrained Transformers.
We adapt the language representations for the demographic dimensions of gender and age, using continuous language modeling and dynamic multi-task learning.
arXiv Detail & Related papers (2022-10-13T21:16:27Z) - Predicting Human Psychometric Properties Using Computational Language
Models [5.806723407090421]
Transformer-based language models (LMs) continue to achieve state-of-the-art performance on natural language processing (NLP) benchmarks.
Can LMs be of use in predicting the psychometric properties of test items, when those items are given to human participants?
We gather responses from numerous human participants and LMs on a broad diagnostic test of linguistic competencies.
We then use the human responses to calculate standard psychometric properties of the items in the diagnostic test, using the human responses and the LM responses separately.
arXiv Detail & Related papers (2022-05-12T16:40:12Z) - A Multi-modal Machine Learning Approach and Toolkit to Automate
Recognition of Early Stages of Dementia among British Sign Language Users [5.8720142291102135]
A timely diagnosis helps in obtaining necessary support and appropriate medication.
Deep learning based approaches for image and video analysis and understanding are promising.
We show that our approach is not over-fitted and has the potential to scale up.
arXiv Detail & Related papers (2020-10-01T16:35:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.