Related papers: Automated Quality Assessment for LLM-Based Complex Qualitative Coding: A Confidence-Diversity Framework

Automated Quality Assessment for LLM-Based Complex Qualitative Coding: A Confidence-Diversity Framework

URL: http://arxiv.org/abs/2508.20462v2
Date: Tue, 30 Sep 2025 08:37:24 GMT
Title: Automated Quality Assessment for LLM-Based Complex Qualitative Coding: A Confidence-Diversity Framework
Authors: Zhilong Zhao, Yindi Liu,
Abstract summary: We develop a dual-signal quality assessment framework that combines model confidence with inter-model consensus (external entropy)<n>We evaluate it across legal reasoning, political analysis, and medical classification transcripts.<n>The framework offers a principled, domain-agnostic quality assurance mechanism that scales qualitative coding without extensive double-coding.
Score: 0.23872611575805827
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Computational social science lacks a scalable and reliable mechanism to assure quality for AI-assisted qualitative coding when tasks demand domain expertise and long-text reasoning, and traditional double-coding is prohibitively costly at scale. We develop and validate a dual-signal quality assessment framework that combines model confidence with inter-model consensus (external entropy) and evaluate it across legal reasoning (390 Supreme Court cases), political analysis (645 hyperpartisan articles), and medical classification (1,000 clinical transcripts). External entropy is consistently negatively associated with accuracy (r = -0.179 to -0.273, p < 0.001), while confidence is positively associated in two domains (r = 0.104 to 0.429). Weight optimization improves over single-signal baselines by 6.6-113.7% and transfers across domains (100% success), and an intelligent triage protocol reduces manual verification effort by 44.6% while maintaining quality. The framework offers a principled, domain-agnostic quality assurance mechanism that scales qualitative coding without extensive double-coding, provides actionable guidance for sampling and verification, and enables larger and more diverse corpora to be analyzed with maintained rigor.

Related papers

A Multi-Dimensional Quality Scoring Framework for Decentralized LLM Inference with Proof of Quality [2.621929201001929]
We propose a multi-dimensional quality scoring framework that decomposes output quality into modular dimensions.<n>We show that seemingly reasonable dimensions can be task-dependent and even negatively correlated with reference quality without calibration.
arXiv Detail & Related papers (2026-03-04T13:05:46Z)
Agentic Confidence Calibration [67.50096917021521]
Holistic Trajectory (HTC) is a novel diagnostic framework for AI agents.<n>HTC consistently surpasses strong baselines in both calibration and discrimination.<n>HTC provides interpretability by revealing the signals behind failure.
arXiv Detail & Related papers (2026-01-22T09:08:25Z)
Hybrid-Code: A Privacy-Preserving, Redundant Multi-Agent Framework for Reliable Local Clinical Coding [0.0]
Clinical coding automation using cloud-based Large Language Models (LLMs) poses privacy risks and latency bottlenecks.<n>We introduce Hybrid-Code, a hybrid neuro-symbolic multi-agent framework for local clinical coding that ensures production reliability through redundancy and verification.
arXiv Detail & Related papers (2025-12-26T02:27:36Z)
AutoMalDesc: Large-Scale Script Analysis for Cyber Threat Research [81.04845910798387]
Generating natural language explanations for threat detections remains an open problem in cybersecurity research.<n>We present AutoMalDesc, an automated static analysis summarization framework that operates independently at scale.<n>We publish our complete dataset of more than 100K script samples, including annotated seed (0.9K) datasets, along with our methodology and evaluation framework.
arXiv Detail & Related papers (2025-11-17T13:05:25Z)
Annotation-Efficient Universal Honesty Alignment [70.05453324928955]
Existing methods either rely on training-free confidence estimation or training-based calibration with correctness annotations.<n>Elicitation-Then-Calibration (EliCal) is a two-stage framework that first elicits internal confidence using inexpensive self-consistency supervision, then calibrates this confidence with a small set of correctness annotations.<n>EliCal achieves near-optimal alignment with only 1k correctness annotations (0.18% of full supervision) and better alignment performance on unseen MMLU tasks than the calibration-only baseline.
arXiv Detail & Related papers (2025-10-20T13:05:22Z)
AI Agentic Vulnerability Injection And Transformation with Optimized Reasoning [2.918225266151982]
This paper introduces a novel framework designed to automatically introduce realistic, category-specific vulnerabilities into secure C/C++s to generate datasets.<n>The proposed approach coordinates multiple AI agents that simulate expert reasoning, along with function agents and traditional code analysis tools.<n>Our experimental study on 116 code samples from three different benchmarks suggests that our approach outperforms other techniques with regard to dataset accuracy.
arXiv Detail & Related papers (2025-08-28T14:59:39Z)
LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models [51.55869466207234]
Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting.<n>We introduce LLMEval-3, a framework for dynamic evaluation of LLMs.<n>LLEval-3 is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run.
arXiv Detail & Related papers (2025-08-07T14:46:30Z)
A Confidence-Diversity Framework for Calibrating AI Judgement in Accessible Qualitative Coding Tasks [0.0]
Confidence-diversity calibration is a quality assessment framework for accessible coding tasks.<n>Analysing 5,680 coding decisions from eight state-of-the-art LLMs, we find that mean self-confidence tracks inter-model agreement closely.
arXiv Detail & Related papers (2025-08-04T03:47:10Z)
Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models [87.66870367661342]
Large language models (LLMs) are used in AI applications in healthcare.<n>Red-teaming framework that continuously stress-test LLMs can reveal significant weaknesses in four safety-critical domains.<n>A suite of adversarial agents is applied to autonomously mutate test cases, identify/evolve unsafe-triggering strategies, and evaluate responses.<n>Our framework delivers an evolvable, scalable, and reliable safeguard for the next generation of medical AI.
arXiv Detail & Related papers (2025-07-30T08:44:22Z)
CoQuIR: A Comprehensive Benchmark for Code Quality-Aware Information Retrieval [44.39462689088926]
CoQuIR is the first large-scale, multilingual benchmark designed to evaluate quality-aware code retrieval.<n>CoQuIR provides fine-grained quality annotations for 42,725 queries and 134,907 code snippets in 11 programming languages.
arXiv Detail & Related papers (2025-05-31T13:00:17Z)
Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z)
MCQA-Eval: Efficient Confidence Evaluation in NLG with Gold-Standard Correctness Labels [16.300463494913593]
Large Language Models (LLMs) require robust confidence estimation.<n>McQCA-Eval is an evaluation framework for assessing confidence measures in Natural Language Generation.
arXiv Detail & Related papers (2025-02-20T05:09:29Z)
SeCodePLT: A Unified Platform for Evaluating the Security of Code GenAI [58.29510889419971]
Existing benchmarks for evaluating the security risks and capabilities of code-generating large language models (LLMs) face several key limitations.<n>We introduce a general and scalable benchmark construction framework that begins with manually validated, high-quality seed examples and expands them via targeted mutations.<n>Applying this framework to Python, C/C++, and Java, we build SeCodePLT, a dataset of more than 5.9k samples spanning 44 CWE-based risk categories and three security capabilities.
arXiv Detail & Related papers (2024-10-14T21:17:22Z)
ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models [65.79770974145983]
ASSERT, Automated Safety Scenario Red Teaming, consists of three methods -- semantically aligned augmentation, target bootstrapping, and adversarial knowledge injection. We partition our prompts into four safety domains for a fine-grained analysis of how the domain affects model performance. We find statistically significant performance differences of up to 11% in absolute classification accuracy among semantically related scenarios and error rates of up to 19% absolute error in zero-shot adversarial settings.
arXiv Detail & Related papers (2023-10-14T17:10:28Z)
A Holistic Assessment of the Reliability of Machine Learning Systems [30.638615396429536]
This paper proposes a holistic assessment methodology for the reliability of machine learning (ML) systems. Our framework evaluates five key properties: in-distribution accuracy, distribution-shift robustness, adversarial robustness, calibration, and out-of-distribution detection. To provide insights into the performance of different algorithmic approaches, we identify and categorize state-of-the-art techniques.
arXiv Detail & Related papers (2023-07-20T05:00:13Z)
Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.