Related papers: Accuracy is Not Agreement: Expert-Aligned Evaluation of Crash Narrative Classification Models

Accuracy is Not Agreement: Expert-Aligned Evaluation of Crash Narrative Classification Models

URL: http://arxiv.org/abs/2504.13068v1
Date: Thu, 17 Apr 2025 16:29:08 GMT
Title: Accuracy is Not Agreement: Expert-Aligned Evaluation of Crash Narrative Classification Models
Authors: Sudesh Ramesh Bhagat, Ibne Farabi Shihab, Anuj Sharma,
Abstract summary: This study explores the relationship between deep learning (DL) model accuracy and expert agreement in the classification of crash narratives.<n>We evaluate five DL models -- including BERT variants and the Universal Sentence (USE) -- against expert-labeled data and narrative text.<n>Findings indicate that expert-aligned models tend to rely more on contextual and temporal language cues, rather than location-specific keywords.
Score: 2.1797343876622097
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This study explores the relationship between deep learning (DL) model accuracy and expert agreement in the classification of crash narratives. We evaluate five DL models -- including BERT variants, the Universal Sentence Encoder (USE), and a zero-shot classifier -- against expert-labeled data and narrative text. The analysis is further extended to four large language models (LLMs): GPT-4, LLaMA 3, Qwen, and Claude. Our results reveal a counterintuitive trend: models with higher technical accuracy often exhibit lower agreement with domain experts, whereas LLMs demonstrate greater expert alignment despite relatively lower accuracy scores. To quantify and interpret model-expert agreement, we employ Cohen's Kappa, Principal Component Analysis (PCA), and SHAP-based explainability techniques. Findings indicate that expert-aligned models tend to rely more on contextual and temporal language cues, rather than location-specific keywords. These results underscore that accuracy alone is insufficient for evaluating models in safety-critical NLP applications. We advocate for incorporating expert agreement as a complementary metric in model evaluation frameworks and highlight the promise of LLMs as interpretable, scalable tools for crash analysis pipelines.

Related papers

Navigating Semantic Relations: Challenges for Language Models in Abstract Common-Sense Reasoning [5.4141465747474475]
Large language models (LLMs) have achieved remarkable performance in generating human-like text and solving problems of moderate complexity.<n>We systematically evaluate abstract common-sense reasoning in LLMs using the ConceptNet knowledge graph.
arXiv Detail & Related papers (2025-02-19T20:20:24Z)
The Reliability Paradox: Exploring How Shortcut Learning Undermines Language Model Calibration [5.616884466478886]
Pre-trained language models (PLMs) have enabled significant performance gains in the field of natural language processing.<n>Recent studies have found PLMs to suffer from miscalibration, indicating a lack of accuracy in the confidence estimates provided by these models.<n>This paper investigates whether lower calibration error implies reliable decision rules for a language model.
arXiv Detail & Related papers (2024-12-17T08:04:28Z)
A NotSo Simple Way to Beat Simple Bench [0.0]
This paper presents a novel framework for enhancing reasoning capabilities in large language models (LLMs)<n>We propose a multi-step prompting strategy coupled with global consistency checks to improve model accuracy and robustness.<n>Our results reveal model-specific strengths: Claude excels in maintaining logical consistency, while GPT-4o exhibits exploratory creativity but struggles with ambiguous prompts.
arXiv Detail & Related papers (2024-12-12T16:04:31Z)
Explaining word embeddings with perfect fidelity: Case study in research impact prediction [0.0]
Self-model Rated Entities (SMER) for logistic regression-based classification models trained on word embeddings. We show that SMER has theoretically perfect fidelity with the explained model, as its prediction corresponds exactly to the average of predictions for individual words in the text.
arXiv Detail & Related papers (2024-09-24T09:28:24Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts [54.07541591018305]
We present MAD-Bench, a benchmark that contains 1000 test samples divided into 5 categories, such as non-existent objects, count of objects, and spatial relationship. We provide a comprehensive analysis of popular MLLMs, ranging from GPT-4v, Reka, Gemini-Pro, to open-sourced models, such as LLaVA-NeXT and MiniCPM-Llama3. While GPT-4o achieves 82.82% accuracy on MAD-Bench, the accuracy of any other model in our experiments ranges from 9% to 50%.
arXiv Detail & Related papers (2024-02-20T18:31:27Z)
Decoding News Narratives: A Critical Analysis of Large Language Models in Framing Detection [10.301985230669684]
This paper presents a comprehensive analysis of GPT-4, GPT-3.5 Turbo, and FLAN-T5 models in detecting framing in news headlines. We evaluated these models in various scenarios: zero-shot, few-shot with in-domain examples, cross-domain examples, and settings where models explain their predictions.
arXiv Detail & Related papers (2024-02-18T15:27:48Z)
CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting. CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z)
"You Are An Expert Linguistic Annotator": Limits of LLMs as Analyzers of Abstract Meaning Representation [60.863629647985526]
We examine the successes and limitations of the GPT-3, ChatGPT, and GPT-4 models in analysis of sentence meaning structure. We find that models can reliably reproduce the basic format of AMR, and can often capture core event, argument, and modifier structure. Overall, our findings indicate that these models out-of-the-box can capture aspects of semantic structure, but there remain key limitations in their ability to support fully accurate semantic analyses or parses.
arXiv Detail & Related papers (2023-10-26T21:47:59Z)
Evaluating and Explaining Large Language Models for Code Using Syntactic Structures [74.93762031957883]
This paper introduces ASTxplainer, an explainability method specific to Large Language Models for code. At its core, ASTxplainer provides an automated method for aligning token predictions with AST nodes. We perform an empirical evaluation on 12 popular LLMs for code using a curated dataset of the most popular GitHub projects.
arXiv Detail & Related papers (2023-08-07T18:50:57Z)
A Comprehensive Evaluation and Analysis Study for Chinese Spelling Check [53.152011258252315]
We show that using phonetic and graphic information reasonably is effective for Chinese Spelling Check. Models are sensitive to the error distribution of the test set, which reflects the shortcomings of models. The commonly used benchmark, SIGHAN, can not reliably evaluate models' performance.
arXiv Detail & Related papers (2023-07-25T17:02:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.