Exploring Language Patterns in a Medical Licensure Exam Item Bank
- URL: http://arxiv.org/abs/2111.10501v1
- Date: Sat, 20 Nov 2021 02:45:35 GMT
- Title: Exploring Language Patterns in a Medical Licensure Exam Item Bank
- Authors: Swati Padhee, Kimberly Swygert, Ian Micir
- Abstract summary: This study is the first attempt using machine learning (ML) and NLP to explore language bias on a large item bank.
Using a prediction algorithm trained on clusters of similar item stems, we demonstrate that our approach can be used to review large item banks for potential biased language.
- Score: 0.25782420501870296
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study examines the use of natural language processing (NLP) models to
evaluate whether language patterns used by item writers in a medical licensure
exam might contain evidence of biased or stereotypical language. This type of
bias in item language choices can be particularly impactful for items in a
medical licensure assessment, as it could pose a threat to content validity and
defensibility of test score validity evidence. To the best of our knowledge,
this is the first attempt using machine learning (ML) and NLP to explore
language bias on a large item bank. Using a prediction algorithm trained on
clusters of similar item stems, we demonstrate that our approach can be used to
review large item banks for potential biased language or stereotypical patient
characteristics in clinical science vignettes. The findings may guide the
development of methods to address stereotypical language patterns found in test
items and enable an efficient updating of those items, if needed, to reflect
contemporary norms, thereby improving the evidence to support the validity of
the test scores.
Related papers
- Textual Entailment for Effective Triple Validation in Object Prediction [4.94309218465563]
We propose to use textual entailment to validate facts extracted from language models through cloze statements.
Our results show that triple validation based on textual entailment improves language model predictions in different training regimes.
arXiv Detail & Related papers (2024-01-29T16:50:56Z) - Language Generation from Brain Recordings [68.97414452707103]
We propose a generative language BCI that utilizes the capacity of a large language model and a semantic brain decoder.
The proposed model can generate coherent language sequences aligned with the semantic content of visual or auditory language stimuli.
Our findings demonstrate the potential and feasibility of employing BCIs in direct language generation.
arXiv Detail & Related papers (2023-11-16T13:37:21Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - Interpretable Medical Diagnostics with Structured Data Extraction by
Large Language Models [59.89454513692417]
Tabular data is often hidden in text, particularly in medical diagnostic reports.
We propose a novel, simple, and effective methodology for extracting structured tabular data from textual medical reports, called TEMED-LLM.
We demonstrate that our approach significantly outperforms state-of-the-art text classification models in medical diagnostics.
arXiv Detail & Related papers (2023-06-08T09:12:28Z) - Evaluating the Effectiveness of Pre-trained Language Models in
Predicting the Helpfulness of Online Product Reviews [0.21485350418225244]
We compare the use of RoBERTa and XLM-R language models to predict the helpfulness of online product reviews.
We employ the Amazon review dataset for our experiments.
arXiv Detail & Related papers (2023-02-19T18:22:59Z) - Average Is Not Enough: Caveats of Multilingual Evaluation [0.0]
We argue that a qualitative analysis informed by comparative linguistics is needed for multilingual results to detect this kind of bias.
We show in our case study that results in published works can indeed be linguistically biased and we demonstrate that visualization based on onEL typological database can detect it.
arXiv Detail & Related papers (2023-01-03T18:23:42Z) - A Latent-Variable Model for Intrinsic Probing [93.62808331764072]
We propose a novel latent-variable formulation for constructing intrinsic probes.
We find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.
arXiv Detail & Related papers (2022-01-20T15:01:12Z) - CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [51.38557174322772]
We present the first Chinese Biomedical Language Understanding Evaluation benchmark.
It is a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification.
We report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.
arXiv Detail & Related papers (2021-06-15T12:25:30Z) - Do language models learn typicality judgments from text? [6.252236971703546]
We evaluate predictive language models (LMs) on a prevalent phenomenon in cognitive science: typicality.
Our first test targets whether typicality modulates LMs in assigning taxonomic category memberships to items.
The second test investigates sensitivities to typicality in LMs' probabilities when extending new information about items to their categories.
arXiv Detail & Related papers (2021-05-06T21:56:40Z) - Curious Case of Language Generation Evaluation Metrics: A Cautionary
Tale [52.663117551150954]
A few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation.
This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them.
In this paper, we urge the community for more careful consideration of how they automatically evaluate their models.
arXiv Detail & Related papers (2020-10-26T13:57:20Z) - Hurtful Words: Quantifying Biases in Clinical Contextual Word Embeddings [16.136832979324467]
We pretrain deep embedding models (BERT) on medical notes from the MIMIC-III hospital dataset.
We identify dangerous latent relationships that are captured by the contextual word embeddings.
We evaluate performance gaps across different definitions of fairness on over 50 downstream clinical prediction tasks.
arXiv Detail & Related papers (2020-03-11T23:21:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.