Catching UX Flaws in Code: Leveraging LLMs to Identify Usability Flaws at the Development Stage
- URL: http://arxiv.org/abs/2512.04262v1
- Date: Wed, 03 Dec 2025 21:02:54 GMT
- Title: Catching UX Flaws in Code: Leveraging LLMs to Identify Usability Flaws at the Development Stage
- Authors: Nolan Platt, Ethan Luchs, Sehrish Nizamani,
- Abstract summary: This paper investigates whether large language models (LLMs) can provide reliable and consistent assessments at the development stage.<n>We generated over 850 evaluations in three independent evaluations per site using a pipeline of OpenAI's GPT-4o.<n>For issue detection, the model demonstrated moderate consistency, with an average pairwise Cohen's Kappa of 0.50 and an exact agreement of 84%.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Usability evaluations are essential for ensuring that modern interfaces meet user needs, yet traditional heuristic evaluations by human experts can be time-consuming and subjective, especially early in development. This paper investigates whether large language models (LLMs) can provide reliable and consistent heuristic assessments at the development stage. By applying Jakob Nielsen's ten usability heuristics to thirty open-source websites, we generated over 850 heuristic evaluations in three independent evaluations per site using a pipeline of OpenAI's GPT-4o. For issue detection, the model demonstrated moderate consistency, with an average pairwise Cohen's Kappa of 0.50 and an exact agreement of 84%. Severity judgments showed more variability: weighted Cohen's Kappa averaged 0.63, but exact agreement was just 56%, and Krippendorff's Alpha was near zero. These results suggest that while GPT-4o can produce internally consistent evaluations, especially for identifying the presence of usability issues, its ability to judge severity varies and requires human oversight in practice. Our findings highlight the feasibility and limitations of using LLMs for early-stage, automated usability testing, and offer a foundation for improving consistency in automated User Experience (UX) evaluation. To the best of our knowledge, our work provides one of the first quantitative inter-rater reliability analyses of automated heuristic evaluation and highlights methods for improving model consistency.
Related papers
- Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements [78.87065404966002]
Existing benchmarks predominantly curate questions at the question level.<n>We propose Encyclo-K, a statement-based benchmark that rethinks benchmark construction from the ground up.
arXiv Detail & Related papers (2025-12-31T13:55:54Z) - Revisiting the Reliability of Language Models in Instruction-Following [15.281163913211818]
LLMs have achieved near-ceiling instruction-following accuracy on benchmarks such as IFEval.<n>We study nuance-oriented reliability: whether models exhibit consistent competence across cousin prompts that convey analogous user intents but with subtle nuances.<n>Our findings highlight nuance-oriented reliability as a crucial yet underexplored next step toward more dependable and trustworthy LLM behavior.
arXiv Detail & Related papers (2025-12-15T02:57:55Z) - MicroProbe: Efficient Reliability Assessment for Foundation Models with Minimal Data [0.0]
microprobe achieves comprehensive reliability assessment using only 100 strategically selected probe examples.<n>We demonstrate that microprobe achieves 23.5% higher composite reliability scores compared to random sampling baselines.<n> microprobe completes reliability assessment with 99.9% statistical power while representing a 90% reduction in assessment cost and maintaining 95% of traditional method coverage.
arXiv Detail & Related papers (2025-11-30T13:01:57Z) - TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them [58.04324690859212]
Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks.<n>We identify two fundamental types of inconsistencies: Score-Comparison Inconsistency and Pairwise Transitivity Inconsistency.<n>We propose TrustJudge, a probabilistic framework that addresses these limitations through two key innovations.
arXiv Detail & Related papers (2025-09-25T13:04:29Z) - Automated Quality Assessment for LLM-Based Complex Qualitative Coding: A Confidence-Diversity Framework [0.23872611575805827]
We develop a dual-signal quality assessment framework that combines model confidence with inter-model consensus (external entropy)<n>We evaluate it across legal reasoning, political analysis, and medical classification transcripts.<n>The framework offers a principled, domain-agnostic quality assurance mechanism that scales qualitative coding without extensive double-coding.
arXiv Detail & Related papers (2025-08-28T06:25:07Z) - LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models [51.55869466207234]
Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting.<n>We introduce LLMEval-3, a framework for dynamic evaluation of LLMs.<n>LLEval-3 is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run.
arXiv Detail & Related papers (2025-08-07T14:46:30Z) - A Confidence-Diversity Framework for Calibrating AI Judgement in Accessible Qualitative Coding Tasks [0.0]
Confidence-diversity calibration is a quality assessment framework for accessible coding tasks.<n>Analysing 5,680 coding decisions from eight state-of-the-art LLMs, we find that mean self-confidence tracks inter-model agreement closely.
arXiv Detail & Related papers (2025-08-04T03:47:10Z) - The Confidence Paradox: Can LLM Know When It's Wrong [5.445980143646736]
Document Visual Question Answering (DocVQA) models often produce overconfident or ethically misaligned responses.<n>We propose HonestVQA, a model-agnostic, self-supervised framework that aligns model confidence with correctness using weighted loss and contrastive learning.
arXiv Detail & Related papers (2025-06-30T02:06:54Z) - VADER: A Human-Evaluated Benchmark for Vulnerability Assessment, Detection, Explanation, and Remediation [0.8087612190556891]
VADER comprises 174 real-world software vulnerabilities, each carefully curated from GitHub and annotated by security experts.<n>For each vulnerability case, models are tasked with identifying the flaw, classifying it using Common Weaknession (CWE), explaining its underlying cause, proposing a patch, and formulating a test plan.<n>Using a one-shot prompting strategy, we benchmark six state-of-the-art LLMs (Claude 3.7 Sonnet, Gemini 2.5 Pro, GPT-4.1, GPT-4.5, Grok 3 Beta, and o3) on VADER.<n>Our results show that current state-of-the-
arXiv Detail & Related papers (2025-05-26T01:20:44Z) - Self-Evolving Critique Abilities in Large Language Models [59.861013614500024]
This paper explores enhancing critique abilities of Large Language Models (LLMs)<n>We introduce SCRIT, a framework that trains LLMs with self-generated data to evolve their critique abilities.<n>Our analysis reveals that SCRIT's performance scales positively with data and model size.
arXiv Detail & Related papers (2025-01-10T05:51:52Z) - CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution [74.41064280094064]
textbfJudger-1 is the first open-source textbfall-in-one judge LLM.
CompassJudger-1 is a general-purpose LLM that demonstrates remarkable versatility.
textbfJudgerBench is a new benchmark that encompasses various subjective evaluation tasks.
arXiv Detail & Related papers (2024-10-21T17:56:51Z) - Llamas Know What GPTs Don't Show: Surrogate Models for Confidence
Estimation [70.27452774899189]
Large language models (LLMs) should signal low confidence on examples where they are incorrect, instead of misleading the user.
As of November 2023, state-of-the-art LLMs do not provide access to these probabilities.
Our best method composing linguistic confidences and surrogate model probabilities gives state-of-the-art confidence estimates on all 12 datasets.
arXiv Detail & Related papers (2023-11-15T11:27:44Z) - Is GPT-4 a reliable rater? Evaluating Consistency in GPT-4 Text Ratings [63.35165397320137]
This study investigates the consistency of feedback ratings generated by OpenAI's GPT-4.
The model rated responses to tasks within the Higher Education subject domain of macroeconomics in terms of their content and style.
arXiv Detail & Related papers (2023-08-03T12:47:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.