Using Large Language Models for Qualitative Analysis can Introduce
Serious Bias
- URL: http://arxiv.org/abs/2309.17147v2
- Date: Thu, 5 Oct 2023 12:25:18 GMT
- Title: Using Large Language Models for Qualitative Analysis can Introduce
Serious Bias
- Authors: Julian Ashwin, Aditya Chhabra and Vijayendra Rao
- Abstract summary: Large Language Models (LLMs) are quickly becoming ubiquitous, but the implications for social science research are not yet well understood.
This paper asks whether LLMs can help us analyse large-N qualitative data from open-ended interviews, with an application to transcripts of interviews with Rohingya refugees in Cox's Bazaar, Bangladesh.
We find that a great deal of caution is needed in using LLMs to annotate text as there is a risk of introducing biases that can lead to misleading inferences.
- Score: 0.09208007322096534
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) are quickly becoming ubiquitous, but the
implications for social science research are not yet well understood. This
paper asks whether LLMs can help us analyse large-N qualitative data from
open-ended interviews, with an application to transcripts of interviews with
Rohingya refugees in Cox's Bazaar, Bangladesh. We find that a great deal of
caution is needed in using LLMs to annotate text as there is a risk of
introducing biases that can lead to misleading inferences. We here mean bias in
the technical sense, that the errors that LLMs make in annotating interview
transcripts are not random with respect to the characteristics of the interview
subjects. Training simpler supervised models on high-quality human annotations
with flexible coding leads to less measurement error and bias than LLM
annotations. Therefore, given that some high quality annotations are necessary
in order to asses whether an LLM introduces bias, we argue that it is probably
preferable to train a bespoke model on these annotations than it is to use an
LLM for annotation.
Related papers
- LLM-Assisted Relevance Assessments: When Should We Ask LLMs for Help? [18.663118865354427]
Test collections are information retrieval tools that allow researchers to quickly and easily evaluate ranking algorithms.
We propose textbfLLM-textbfAssisted textbfRelevance textbfAssessments (textbfLARA) to balance manual annotations with LLM annotations.
arXiv Detail & Related papers (2024-11-11T11:17:35Z) - What do Large Language Models Need for Machine Translation Evaluation? [12.42394213466485]
Large language models (LLMs) can achieve results comparable to fine-tuned multilingual pre-trained language models.
This paper explores what translation information, such as the source, reference, translation errors and annotation guidelines, is needed for LLMs to evaluate machine translation quality.
arXiv Detail & Related papers (2024-10-04T09:50:45Z) - Can Unconfident LLM Annotations Be Used for Confident Conclusions? [34.23823544208315]
Large language models (LLMs) have shown high agreement with human raters across a variety of tasks.
We introduce Confidence-Driven Inference: a method that combines LLM confidence indicators to strategically select which human annotations should be collected.
arXiv Detail & Related papers (2024-08-27T17:03:18Z) - A Chinese Dataset for Evaluating the Safeguards in Large Language Models [46.43476815725323]
Large language models (LLMs) can produce harmful responses.
This paper introduces a dataset for the safety evaluation of Chinese LLMs.
We then extend it to two other scenarios that can be used to better identify false negative and false positive examples.
arXiv Detail & Related papers (2024-02-19T14:56:18Z) - Learning to Generate Explainable Stock Predictions using Self-Reflective
Large Language Models [54.21695754082441]
We propose a framework to teach Large Language Models (LLMs) to generate explainable stock predictions.
A reflective agent learns how to explain past stock movements through self-reasoning, while the PPO trainer trains the model to generate the most likely explanations.
Our framework can outperform both traditional deep-learning and LLM methods in prediction accuracy and Matthews correlation coefficient.
arXiv Detail & Related papers (2024-02-06T03:18:58Z) - Supervised Knowledge Makes Large Language Models Better In-context Learners [94.89301696512776]
Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering.
The challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored.
We propose a framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks.
arXiv Detail & Related papers (2023-12-26T07:24:46Z) - AlignedCoT: Prompting Large Language Models via Native-Speaking Demonstrations [52.43593893122206]
Alignedcot is an in-context learning technique for invoking Large Language Models.
It achieves consistent and correct step-wise prompts in zero-shot scenarios.
We conduct experiments on mathematical reasoning and commonsense reasoning.
arXiv Detail & Related papers (2023-11-22T17:24:21Z) - CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large
Language Models for Data Annotation [94.59630161324013]
We propose CoAnnotating, a novel paradigm for Human-LLM co-annotation of unstructured texts at scale.
Our empirical study shows CoAnnotating to be an effective means to allocate work from results on different datasets, with up to 21% performance improvement over random baseline.
arXiv Detail & Related papers (2023-10-24T08:56:49Z) - "Kelly is a Warm Person, Joseph is a Role Model": Gender Biases in
LLM-Generated Reference Letters [97.11173801187816]
Large Language Models (LLMs) have recently emerged as an effective tool to assist individuals in writing various types of content.
This paper critically examines gender biases in LLM-generated reference letters.
arXiv Detail & Related papers (2023-10-13T16:12:57Z) - Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools.
Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions.
Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z) - Validating Large Language Models with ReLM [11.552979853457117]
Large language models (LLMs) have been touted for their ability to generate natural-sounding text.
There are growing concerns around possible negative effects of LLMs such as data memorization, bias, and inappropriate language.
We introduce ReLM, a system for validating and querying LLMs using standard regular expressions.
arXiv Detail & Related papers (2022-11-21T21:40:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.