The Risks of Using Large Language Models for Text Annotation in Social Science Research
- URL: http://arxiv.org/abs/2503.22040v1
- Date: Thu, 27 Mar 2025 23:33:36 GMT
- Title: The Risks of Using Large Language Models for Text Annotation in Social Science Research
- Authors: Hao Lin, Yongjun Zhang,
- Abstract summary: We conduct a systematic evaluation of the promises and risks of using large language models (LLMs) in coding tasks.<n>We propose a framework for social scientists to incorporate LLMs into text annotation, either as the primary coding decision-maker or as a coding assistant.
- Score: 3.276333240221372
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative artificial intelligence (GenAI) or large language models (LLMs) have the potential to revolutionize computational social science, particularly in automated textual analysis. In this paper, we conduct a systematic evaluation of the promises and risks of using LLMs for diverse coding tasks, with social movement studies serving as a case example. We propose a framework for social scientists to incorporate LLMs into text annotation, either as the primary coding decision-maker or as a coding assistant. This framework provides tools for researchers to develop the optimal prompt, and to examine and report the validity and reliability of LLMs as a methodological tool. Additionally, we discuss the associated epistemic risks related to validity, reliability, replicability, and transparency. We conclude with several practical guidelines for using LLMs in text annotation tasks, and how we can better communicate the epistemic risks in research.
Related papers
- An Overview of Large Language Models for Statisticians [109.38601458831545]
Large Language Models (LLMs) have emerged as transformative tools in artificial intelligence (AI)<n>This paper explores potential areas where statisticians can make important contributions to the development of LLMs.<n>We focus on issues such as uncertainty quantification, interpretability, fairness, privacy, watermarking and model adaptation.
arXiv Detail & Related papers (2025-02-25T03:40:36Z) - Large Language Model for Qualitative Research -- A Systematic Mapping Study [3.302912592091359]
Large Language Models (LLMs), powered by advanced generative AI, have emerged as transformative tools.
This study systematically maps the literature on the use of LLMs for qualitative research.
Findings reveal that LLMs are utilized across diverse fields, demonstrating the potential to automate processes.
arXiv Detail & Related papers (2024-11-18T21:28:00Z) - Systematic Task Exploration with LLMs: A Study in Citation Text Generation [63.50597360948099]
Large language models (LLMs) bring unprecedented flexibility in defining and executing complex, creative natural language generation (NLG) tasks.
We propose a three-component research framework that consists of systematic input manipulation, reference data, and output measurement.
We use this framework to explore citation text generation -- a popular scholarly NLP task that lacks consensus on the task definition and evaluation metric.
arXiv Detail & Related papers (2024-07-04T16:41:08Z) - Beyond Human Norms: Unveiling Unique Values of Large Language Models through Interdisciplinary Approaches [69.73783026870998]
This work proposes a novel framework, ValueLex, to reconstruct Large Language Models' unique value system from scratch.
Based on Lexical Hypothesis, ValueLex introduces a generative approach to elicit diverse values from 30+ LLMs.
We identify three core value dimensions, Competence, Character, and Integrity, each with specific subdimensions, revealing that LLMs possess a structured, albeit non-human, value system.
arXiv Detail & Related papers (2024-04-19T09:44:51Z) - LLM Inference Unveiled: Survey and Roofline Model Insights [62.92811060490876]
Large Language Model (LLM) inference is rapidly evolving, presenting a unique blend of opportunities and challenges.
Our survey stands out from traditional literature reviews by not only summarizing the current state of research but also by introducing a framework based on roofline model.
This framework identifies the bottlenecks when deploying LLMs on hardware devices and provides a clear understanding of practical problems.
arXiv Detail & Related papers (2024-02-26T07:33:05Z) - Ocassionally Secure: A Comparative Analysis of Code Generation
Assistants [8.573156248244695]
This paper focuses on identifying and understanding the conditions and contexts in which LLMs can be effectively and safely deployed.
We conducted a comparative analysis of four advanced LLMs--GPT-3.5 and GPT-4 using ChatGPT and Bard and Gemini from Google--using 9 separate tasks to assess each model's code generation capabilities.
We collected 61 code outputs and analyzed them across several aspects: functionality, security, performance, complexity, and reliability.
arXiv Detail & Related papers (2024-02-01T15:49:47Z) - An Empirical Study on Usage and Perceptions of LLMs in a Software
Engineering Project [1.433758865948252]
Large Language Models (LLMs) represent a leap in artificial intelligence, excelling in tasks using human language(s)
In this paper, we analyze the AI-generated code, prompts used for code generation, and the human intervention levels to integrate the code into the code base.
Our findings suggest that LLMs can play a crucial role in the early stages of software development.
arXiv Detail & Related papers (2024-01-29T14:32:32Z) - LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language
Models [56.25156596019168]
This paper introduces the LMRL-Gym benchmark for evaluating multi-turn RL for large language models (LLMs)
Our benchmark consists of 8 different language tasks, which require multiple rounds of language interaction and cover a range of tasks in open-ended dialogue and text games.
arXiv Detail & Related papers (2023-11-30T03:59:31Z) - Large Language Models for Code Analysis: Do LLMs Really Do Their Job? [13.48555476110316]
Large language models (LLMs) have demonstrated significant potential in the realm of natural language understanding and programming code processing tasks.
This paper offers a comprehensive evaluation of LLMs' capabilities in performing code analysis tasks.
arXiv Detail & Related papers (2023-10-18T22:02:43Z) - Benchmarking and Explaining Large Language Model-based Code Generation:
A Causality-Centric Approach [12.214585409361126]
Large language models (LLMs)- based code generation is a complex and powerful black-box model.
We propose a novel causal graph-based representation of the prompt and the generated code.
We illustrate the insights that our framework can provide by studying over 3 popular LLMs with over 12 prompt adjustment strategies.
arXiv Detail & Related papers (2023-10-10T14:56:26Z) - Machine-assisted quantitizing designs: augmenting humanities and social sciences with artificial intelligence [0.0]
Large language models (LLMs) have been shown to present an unprecedented opportunity to scale up data analytics in the humanities and social sciences.
We build on mixed methods quantitizing and converting design principles, and feature analysis from linguistics, to transparently integrate human expertise and machine scalability.
The approach is discussed and demonstrated in over a dozen LLM-assisted case studies, covering 9 diverse languages, multiple disciplines and tasks.
arXiv Detail & Related papers (2023-09-24T14:21:50Z) - Red Teaming Language Model Detectors with Language Models [114.36392560711022]
Large language models (LLMs) present significant safety and ethical risks if exploited by malicious users.
Recent works have proposed algorithms to detect LLM-generated text and protect LLMs.
We study two types of attack strategies: 1) replacing certain words in an LLM's output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation.
arXiv Detail & Related papers (2023-05-31T10:08:37Z) - Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and
Toxicity [19.94836502156002]
Large language models (LLMs) may exhibit social prejudice and toxicity, posing ethical and societal dangers of consequences resulting from irresponsibility.
We empirically benchmark ChatGPT on multiple sample datasets.
We find that a significant number of ethical risks cannot be addressed by existing benchmarks.
arXiv Detail & Related papers (2023-01-30T13:20:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.