Large Language Models in Thematic Analysis: Prompt Engineering, Evaluation, and Guidelines for Qualitative Software Engineering Research
- URL: http://arxiv.org/abs/2510.18456v1
- Date: Tue, 21 Oct 2025 09:29:18 GMT
- Title: Large Language Models in Thematic Analysis: Prompt Engineering, Evaluation, and Guidelines for Qualitative Software Engineering Research
- Authors: Cristina Martinez Montes, Robert Feldt, Cristina Miguel Martos, Sofia Ouhbi, Shweta Premanandan, Daniel Graziotin,
- Abstract summary: Large language models (LLMs) are entering qualitative research, yet no reproducible methods exist for integrating them into established approaches like thematic analysis (TA)<n>We designed and iteratively refined prompts for Phases 2-5 of Braun and Clarke's reflexive TA.<n>We conducted blind evaluations with four expert evaluators who applied rubrics derived from Braun and Clarke's quality criteria.
- Score: 5.0043780915457114
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: As artificial intelligence advances, large language models (LLMs) are entering qualitative research workflows, yet no reproducible methods exist for integrating them into established approaches like thematic analysis (TA), one of the most common qualitative methods in software engineering research. Moreover, existing studies lack systematic evaluation of LLM-generated qualitative outputs against established quality criteria. We designed and iteratively refined prompts for Phases 2-5 of Braun and Clarke's reflexive TA, then tested outputs from multiple LLMs against codes and themes produced by experienced researchers. Using 15 interviews on software engineers' well-being, we conducted blind evaluations with four expert evaluators who applied rubrics derived directly from Braun and Clarke's quality criteria. Evaluators preferred LLM-generated codes 61% of the time, finding them analytically useful for answering the research question. However, evaluators also identified limitations: LLMs fragmented data unnecessarily, missed latent interpretations, and sometimes produced themes with unclear boundaries. Our contributions are threefold. First, a reproducible approach integrating refined, documented prompts with an evaluation framework to operationalize Braun and Clarke's reflexive TA. Second, an empirical comparison of LLM- and human-generated codes and themes in software engineering data. Third, guidelines for integrating LLMs into qualitative analysis while preserving methodological rigour, clarifying when and how LLMs can assist effectively and when human interpretation remains essential.
Related papers
- Software Testing with Large Language Models: An Interview Study with Practitioners [2.198430261120653]
The use of large language models in software testing is growing fast as they support numerous tasks.<n>However, their adoption often relies on informal experimentation rather than structured guidance.<n>This study investigates how software testing professionals use LLMs in practice to propose a preliminary, practitioner-informed guideline.
arXiv Detail & Related papers (2025-10-20T05:06:56Z) - Let's Use ChatGPT To Write Our Paper! Benchmarking LLMs To Write the Introduction of a Research Paper [64.50822834679101]
SciIG is a task that evaluates LLMs' ability to produce coherent introductions from titles, abstracts, and related works.<n>We assess five state-of-the-art models, including open-source (DeepSeek-v3, Gemma-3-12B, LLaMA 4-Maverick, MistralAI Small 3.1) and closed-source GPT-4o systems.<n>Results demonstrate LLaMA-4 Maverick's superior performance on most metrics, particularly in semantic similarity and faithfulness.
arXiv Detail & Related papers (2025-08-19T21:11:11Z) - Evaluating Large Language Models for Real-World Engineering Tasks [75.97299249823972]
This paper introduces a curated database comprising over 100 questions derived from authentic, production-oriented engineering scenarios.<n>Using this dataset, we evaluate four state-of-the-art Large Language Models (LLMs)<n>Our results show that LLMs demonstrate strengths in basic temporal and structural reasoning but struggle significantly with abstract reasoning, formal modeling, and context-sensitive engineering logic.
arXiv Detail & Related papers (2025-05-12T14:05:23Z) - Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework [61.38174427966444]
Large Language Models (LLMs) are being used more and more extensively for automated evaluation in various scenarios.<n>Previous studies have attempted to fine-tune open-source LLMs to replicate the evaluation explanations and judgments of powerful proprietary models.<n>We propose a novel evaluation framework, ARJudge, that adaptively formulates evaluation criteria and synthesizes both text-based and code-driven analyses.
arXiv Detail & Related papers (2025-02-26T06:31:45Z) - AIRepr: An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data Science [8.281093505963158]
Large language models (LLMs) are increasingly used to automate data analysis through executable code generation.<n>We present AIRepr, an Analyst-Inspector framework for automatically evaluating and improving the of LLM-generated data analysis.
arXiv Detail & Related papers (2025-02-23T01:15:50Z) - Applications and Implications of Large Language Models in Qualitative Analysis: A New Frontier for Empirical Software Engineering [0.46426852157920906]
The study emphasizes the need for structured strategies and guidelines to optimize LLM use in qualitative research within software engineering.<n>While LLMs show promise in supporting qualitative analysis, human expertise remains crucial for interpreting data, and ongoing exploration of best practices will be vital for their successful integration into empirical software engineering research.
arXiv Detail & Related papers (2024-12-09T15:17:36Z) - MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.94579295913606]
Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia.<n>In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models.<n>This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods.
arXiv Detail & Related papers (2024-11-22T18:59:54Z) - Large Language Model for Qualitative Research -- A Systematic Mapping Study [3.302912592091359]
Large Language Models (LLMs), powered by advanced generative AI, have emerged as transformative tools.<n>This study systematically maps the literature on the use of LLMs for qualitative research.<n>Findings reveal that LLMs are utilized across diverse fields, demonstrating the potential to automate processes.
arXiv Detail & Related papers (2024-11-18T21:28:00Z) - Reconciling Methodological Paradigms: Employing Large Language Models as Novice Qualitative Research Assistants in Talent Management Research [1.0949553365997655]
This study proposes a novel approach by leveraging Retrieval Augmented Generation (RAG) based Large Language Models (LLMs) for analyzing interview transcripts.
The novelty of this work lies in strategizing the research inquiry as one that is augmented by an LLM that serves as a novice research assistant.
Our findings demonstrate that the LLM-augmented RAG approach can successfully extract topics of interest, with significant coverage compared to manually generated topics.
arXiv Detail & Related papers (2024-08-20T17:49:51Z) - Exploring the Reliability of Large Language Models as Customized Evaluators for Diverse NLP Tasks [65.69651759036535]
We analyze whether large language models (LLMs) can serve as reliable alternatives to humans.<n>This paper explores both conventional tasks (e.g., story generation) and alignment tasks (e.g., math reasoning)<n>We find that LLM evaluators can generate unnecessary criteria or omit crucial criteria, resulting in a slight deviation from the experts.
arXiv Detail & Related papers (2023-10-30T17:04:35Z) - Calibrating LLM-Based Evaluator [92.17397504834825]
We propose AutoCalibrate, a multi-stage, gradient-free approach to calibrate and align an LLM-based evaluator toward human preference.
Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels.
Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration.
arXiv Detail & Related papers (2023-09-23T08:46:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.