Related papers: Development and Benchmarking of a Blended Human-AI Qualitative Research Assistant

Development and Benchmarking of a Blended Human-AI Qualitative Research Assistant

URL: http://arxiv.org/abs/2512.00009v1
Date: Tue, 14 Oct 2025 21:17:34 GMT
Title: Development and Benchmarking of a Blended Human-AI Qualitative Research Assistant
Authors: Joseph Matveyenko, James Liu, John David Parsons, Prateek Puri,
Abstract summary: We benchmark Muse, an interactive, AI-powered qualitative research system.<n>We find an inter-rater reliability between Muse and humans of Cohen's $$ = 0.71 for well-specified codes.<n>We also conduct robust error analysis to identify failure mode, guide future improvements, and demonstrate the capacity to correct for human bias.
Score: 1.170789976854236
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Qualitative research emphasizes constructing meaning through iterative engagement with textual data. Traditionally this human-driven process requires navigating coder fatigue and interpretative drift, thus posing challenges when scaling analysis to larger, more complex datasets. Computational approaches to augment qualitative research have been met with skepticism, partly due to their inability to replicate the nuance, context-awareness, and sophistication of human analysis. Large language models, however, present new opportunities to automate aspects of qualitative analysis while upholding rigor and research quality in important ways. To assess their benefits and limitations - and build trust among qualitative researchers - these approaches must be rigorously benchmarked against human-generated datasets. In this work, we benchmark Muse, an interactive, AI-powered qualitative research system that allows researchers to identify themes and annotate datasets, finding an inter-rater reliability between Muse and humans of Cohen's $κ$ = 0.71 for well-specified codes. We also conduct robust error analysis to identify failure mode, guide future improvements, and demonstrate the capacity to correct for human bias.

Related papers

Designing Computational Tools for Exploring Causal Relationships in Qualitative Data [29.086788542710313]
We designed and implemented QualCausal, a system that extracts and illustrates causal relationships through interactive causal network construction and visualization.<n>A feedback study revealed that participants valued our system for reducing the analytical burden and providing cognitive scaffolding.<n>We discuss broader implications for designing computational tools that support qualitative data analysis.
arXiv Detail & Related papers (2026-02-06T08:56:55Z)
The Story is Not the Science: Execution-Grounded Evaluation of Mechanistic Interpretability Research [56.80927148740585]
We address the challenges of scalability and rigor by flipping the dynamic and developing AI agents as research evaluators.<n>We use mechanistic interpretability research as a testbed, build standardized research output, and develop MechEvalAgent.<n>Our work demonstrates the potential of AI agents to transform research evaluation and pave the way for rigorous scientific practices.
arXiv Detail & Related papers (2026-02-05T19:00:02Z)
The AI Imperative: Scaling High-Quality Peer Review in Machine Learning [49.87236114682497]
We argue that AI-assisted peer review must become an urgent research and infrastructure priority.<n>We propose specific roles for AI in enhancing factual verification, guiding reviewer performance, assisting authors in quality improvement, and supporting ACs in decision-making.
arXiv Detail & Related papers (2025-06-09T18:37:14Z)
A Novel, Human-in-the-Loop Computational Grounded Theory Framework for Big Social Data [8.695136686770772]
We argue that confidence in the credibility and robustness of results depends on adopting a 'human-in-the-loop' methodology.<n>We propose a novel methodological framework for Computational Grounded Theory (CGT) that supports the analysis of large qualitative datasets.
arXiv Detail & Related papers (2025-06-06T13:43:12Z)
Identifying Trustworthiness Challenges in Deep Learning Models for Continental-Scale Water Quality Prediction [69.38041171537573]
Water quality is foundational to environmental sustainability, ecosystem resilience, and public health.<n>Deep learning offers transformative potential for large-scale water quality prediction and scientific insights generation.<n>Their widespread adoption in high-stakes operational decision-making, such as pollution mitigation and equitable resource allocation, is prevented by unresolved trustworthiness challenges.
arXiv Detail & Related papers (2025-03-13T01:50:50Z)
MindGYM: What Matters in Question Synthesis for Thinking-Centric Fine-Tuning? [51.85759493254735]
MindGYM is a structured and scalable framework for question synthesis.<n>It infuses high-level reasoning objectives to shape the model's synthesis behavior.<n>It composes more complex multi-hop questions based on QA seeds for deeper reasoning.
arXiv Detail & Related papers (2025-03-12T16:03:03Z)
The Superalignment of Superhuman Intelligence with Large Language Models [63.96120398355404]
We discuss the concept of superalignment from the learning perspective to answer this question.<n>We highlight some key research problems in superalignment, namely, weak-to-strong generalization, scalable oversight, and evaluation.<n>We present a conceptual framework for superalignment, which consists of three modules: an attacker which generates adversary queries trying to expose the weaknesses of a learner model; a learner which will refine itself by learning from scalable feedbacks generated by a critic model along with minimal human experts; and a critic which generates critics or explanations for a given query-response pair, with a target of improving the learner by criticizing.
arXiv Detail & Related papers (2024-12-15T10:34:06Z)
A Computational Method for Measuring "Open Codes" in Qualitative Analysis [44.39424825305388]
This paper presents a theory-informed computational method for measuring inductive coding results from humans and Generative AI (GAI)<n>It measures each coder's contribution against the merged result using four novel metrics: Coverage, Overlap, Novelty, and Divergence.<n>Our work provides a reliable pathway for ensuring methodological rigor in human-AI qualitative analysis.
arXiv Detail & Related papers (2024-11-19T00:44:56Z)
Are AI Detectors Good Enough? A Survey on Quality of Datasets With Machine-Generated Texts [0.0]
A huge number of detectors and collections with AI fragments have emerged.<n>However, the quality of such detectors tends to drop dramatically in the wild.<n>We propose methods for evaluating the quality of datasets containing AI-generated fragments.
arXiv Detail & Related papers (2024-10-18T17:59:57Z)
Challenges and Future Directions of Data-Centric AI Alignment [22.165745901158804]
Current alignment methods primarily focus on designing algorithms and loss functions but often underestimate the crucial role of data.<n>This paper advocates for a shift towards data-centric AI alignment, emphasizing the need to enhance the quality and representativeness of data used in aligning AI systems.
arXiv Detail & Related papers (2024-10-02T19:03:42Z)
Data Analysis in the Era of Generative AI [56.44807642944589]
This paper explores the potential of AI-powered tools to reshape data analysis, focusing on design considerations and challenges. We explore how the emergence of large language and multimodal models offers new opportunities to enhance various stages of data analysis workflow. We then examine human-centered design principles that facilitate intuitive interactions, build user trust, and streamline the AI-assisted analysis workflow across multiple apps.
arXiv Detail & Related papers (2024-09-27T06:31:03Z)
Interactive Multi-Objective Evolutionary Optimization of Software Architectures [0.0]
Putting the human in the loop brings new challenges to the search-based software engineering field. This paper explores how the interactive evolutionary computation can serve as a basis for integrating the human's judgment into the search process.
arXiv Detail & Related papers (2024-01-08T19:15:40Z)
Can AI Serve as a Substitute for Human Subjects in Software Engineering Research? [24.39463126056733]
This vision paper proposes a novel approach to qualitative data collection in software engineering research by harnessing the capabilities of artificial intelligence (AI) We explore the potential of AI-generated synthetic text as an alternative source of qualitative data. We discuss the prospective development of new foundation models aimed at emulating human behavior in observational studies and user evaluations.
arXiv Detail & Related papers (2023-11-18T14:05:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.