From Voices to Validity: Leveraging Large Language Models (LLMs) for
Textual Analysis of Policy Stakeholder Interviews
- URL: http://arxiv.org/abs/2312.01202v1
- Date: Sat, 2 Dec 2023 18:55:14 GMT
- Title: From Voices to Validity: Leveraging Large Language Models (LLMs) for
Textual Analysis of Policy Stakeholder Interviews
- Authors: Alex Liu and Min Sun
- Abstract summary: This study explores the integration of Large Language Models (LLMs) with human expertise to enhance text analysis of stakeholder interviews regarding K-12 education policy within one U.S. state.
Using a mixed-methods approach, human experts developed a codebook and coding processes as informed by domain knowledge and unsupervised topic modeling results.
Results reveal that while GPT-4 thematic coding aligned with human coding by 77.89% at specific themes, expanding to broader themes increased congruence to 96.02%, surpassing traditional Natural Language Processing (NLP) methods by over 25%.
- Score: 14.135107583299277
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Obtaining stakeholders' diverse experiences and opinions about current policy
in a timely manner is crucial for policymakers to identify strengths and gaps
in resource allocation, thereby supporting effective policy design and
implementation. However, manually coding even moderately sized interview texts
or open-ended survey responses from stakeholders can often be labor-intensive
and time-consuming. This study explores the integration of Large Language
Models (LLMs)--like GPT-4--with human expertise to enhance text analysis of
stakeholder interviews regarding K-12 education policy within one U.S. state.
Employing a mixed-methods approach, human experts developed a codebook and
coding processes as informed by domain knowledge and unsupervised topic
modeling results. They then designed prompts to guide GPT-4 analysis and
iteratively evaluate different prompts' performances. This combined
human-computer method enabled nuanced thematic and sentiment analysis. Results
reveal that while GPT-4 thematic coding aligned with human coding by 77.89% at
specific themes, expanding to broader themes increased congruence to 96.02%,
surpassing traditional Natural Language Processing (NLP) methods by over 25%.
Additionally, GPT-4 is more closely matched to expert sentiment analysis than
lexicon-based methods. Findings from quantitative measures and qualitative
reviews underscore the complementary roles of human domain expertise and
automated analysis as LLMs offer new perspectives and coding consistency. The
human-computer interactive approach enhances efficiency, validity, and
interpretability of educational policy research.
Related papers
- Bridging the Evaluation Gap: Leveraging Large Language Models for Topic Model Evaluation [0.0]
This study presents a framework for automated evaluation of dynamically evolving topic in scientific literature using Large Language Models (LLMs)
The proposed approach harnesses LLMs to measure key quality dimensions, such as coherence, repetitiveness, diversity, and topic-document alignment, without heavy reliance on expert annotators or narrow statistical metrics.
arXiv Detail & Related papers (2025-02-11T08:23:56Z) - Assessing Personalized AI Mentoring with Large Language Models in the Computing Field [3.855858854481047]
GPT-4, LLaMA 3, and Palm 2 were evaluated using a zero-shot learning approach without human intervention.
The analysis of frequently used words in the responses indicates that GPT-4 offers more personalized mentoring.
arXiv Detail & Related papers (2024-12-11T14:51:13Z) - Harnessing AI for efficient analysis of complex policy documents: a case study of Executive Order 14110 [44.99833362998488]
Policy documents, such as legislation, regulations, and executive orders, are crucial in shaping society.
This study aims to evaluate the potential of AI in streamlining policy analysis and to identify the strengths and limitations of current AI approaches.
arXiv Detail & Related papers (2024-06-10T11:19:28Z) - QuaLLM: An LLM-based Framework to Extract Quantitative Insights from Online Forums [10.684484559041284]
This study introduces QuaLLM, a novel framework to analyze and extract quantitative insights from text data on online forums.
We applied this framework to analyze over one million comments from two of Reddit's rideshare worker communities.
We uncover significant worker concerns regarding AI and algorithmic platform decisions, responding to regulatory calls about worker insights.
arXiv Detail & Related papers (2024-05-08T18:20:03Z) - Evaluating Large Language Models in Analysing Classroom Dialogue [8.793491910415897]
The study involves datasets from a middle school, encompassing classroom dialogues across mathematics and Chinese classes.
These dialogues were manually coded by educational experts and then analyzed using a customised GPT-4 model.
Results indicate substantial time savings with GPT-4, and a high degree of consistency in coding between the model and human coders, with some discrepancies in specific codes.
arXiv Detail & Related papers (2024-02-04T07:39:06Z) - PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models [72.57329554067195]
ProxyQA is an innovative framework dedicated to assessing longtext generation.
It comprises in-depth human-curated meta-questions spanning various domains, each accompanied by specific proxy-questions with pre-annotated answers.
It assesses the generated content's quality through the evaluator's accuracy in addressing the proxy-questions.
arXiv Detail & Related papers (2024-01-26T18:12:25Z) - Little Giants: Exploring the Potential of Small LLMs as Evaluation
Metrics in Summarization in the Eval4NLP 2023 Shared Task [53.163534619649866]
This paper focuses on assessing the effectiveness of prompt-based techniques to empower Large Language Models to handle the task of quality estimation.
We conducted systematic experiments with various prompting techniques, including standard prompting, prompts informed by annotator instructions, and innovative chain-of-thought prompting.
Our work reveals that combining these approaches using a "small", open source model (orca_mini_v3_7B) yields competitive results.
arXiv Detail & Related papers (2023-11-01T17:44:35Z) - ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate [57.71597869337909]
We build a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of generated responses from different models.
Our analysis shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments.
arXiv Detail & Related papers (2023-08-14T15:13:04Z) - Multi-Dimensional Evaluation of Text Summarization with In-Context
Learning [79.02280189976562]
In this paper, we study the efficacy of large language models as multi-dimensional evaluators using in-context learning.
Our experiments show that in-context learning-based evaluators are competitive with learned evaluation frameworks for the task of text summarization.
We then analyze the effects of factors such as the selection and number of in-context examples on performance.
arXiv Detail & Related papers (2023-06-01T23:27:49Z) - TextFlint: Unified Multilingual Robustness Evaluation Toolkit for
Natural Language Processing [73.16475763422446]
We propose a multilingual robustness evaluation platform for NLP tasks (TextFlint)
It incorporates universal text transformation, task-specific transformation, adversarial attack, subpopulation, and their combinations to provide comprehensive robustness analysis.
TextFlint generates complete analytical reports as well as targeted augmented data to address the shortcomings of the model's robustness.
arXiv Detail & Related papers (2021-03-21T17:20:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.