Related papers: The Limits of Goal-Setting Theory in LLM-Driven Assessment

The Limits of Goal-Setting Theory in LLM-Driven Assessment

URL: http://arxiv.org/abs/2510.06997v1
Date: Wed, 08 Oct 2025 13:20:40 GMT
Title: The Limits of Goal-Setting Theory in LLM-Driven Assessment
Authors: Mrityunjay Kumar,
Abstract summary: Many users interact with AI tools like ChatGPT using a mental model that treats the system as human-like, which we call Model H.<n>This paper tests that assumption through a controlled experiment in which ChatGPT evaluated 29 student submissions using four prompts with increasing specificity.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Many users interact with AI tools like ChatGPT using a mental model that treats the system as human-like, which we call Model H. According to goal-setting theory, increased specificity in goals should reduce performance variance. If Model H holds, then prompting a chatbot with more detailed instructions should lead to more consistent evaluation behavior. This paper tests that assumption through a controlled experiment in which ChatGPT evaluated 29 student submissions using four prompts with increasing specificity. We measured consistency using intra-rater reliability (Cohen's Kappa) across repeated runs. Contrary to expectations, performance did not improve consistently with increased prompt specificity, and performance variance remained largely unchanged. These findings challenge the assumption that LLMs behave like human evaluators and highlight the need for greater robustness and improved input integration in future model development.

Related papers

PromptCD: Test-Time Behavior Enhancement via Polarity-Prompt Contrastive Decoding [85.22047087898311]
We introduce Polarity-Prompt Contrastive Decoding (PromptCD), a test-time behavior control method that generalizes contrastive decoding to broader enhancement settings.<n>PromptCD constructs paired positive and negative guiding prompts for a target behavior and contrasts model responses to reinforce desirable outcomes.<n>Experiments on the "3H" alignment objectives demonstrate consistent and substantial improvements, indicating that post-trained models can achieve meaningful self-enhancement purely at test time.
arXiv Detail & Related papers (2026-02-24T08:56:52Z)
Beyond Words: Evaluating and Bridging Epistemic Divergence in User-Agent Interaction via Theory of Mind [8.740788873949471]
Large Language Models (LLMs) have developed rapidly and are widely applied to both general-purpose and professional tasks.<n>They still struggle to comprehend and respond to the true user needs when intentions and instructions are imprecisely conveyed.
arXiv Detail & Related papers (2026-02-14T16:01:59Z)
Conditional Advantage Estimation for Reinforcement Learning in Large Reasoning Models [50.84995206660551]
We introduce Conditional advANtage estimatiON (CANON) to amplify the impact of a target metric without presuming its direction.<n>CANON based on entropy consistently outperforms prior methods on both math reasoning and high-complexity logic tasks.
arXiv Detail & Related papers (2025-09-28T16:33:07Z)
UGCE: User-Guided Incremental Counterfactual Exploration [2.2789818122188925]
Counterfactual explanations (CFEs) are a popular approach for interpreting machine learning predictions by identifying minimal feature changes that alter model outputs.<n>Existing methods fail to support such iterative updates, instead recomputing explanations from scratch with each change, an inefficient and rigid approach.<n>We propose User-Guided Incremental Counterfactual Exploration (UGCE), a genetic algorithm-based framework that incrementally updates counterfactuals in response to evolving user constraints.
arXiv Detail & Related papers (2025-05-27T15:24:43Z)
Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z)
Improving User Behavior Prediction: Leveraging Annotator Metadata in Supervised Machine Learning Models [20.680357762880163]
Supervised machine-learning models often underperform in predicting user behaviors from conversational text.<n>We introduce the Metadata- Weighted-Sensitive Ensemble Model (MSWEEM), which integrates annotator meta-features like fatigue and speeding.<n>MSWEEM outperforms standard ensembles by 14% on held-out data and 12% on an alternative dataset.
arXiv Detail & Related papers (2025-03-26T21:30:48Z)
TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles [2.8839090723566296]
TurtleBench collects real user guesses from our online Turtle Soup Puzzle platform. TurtleBench includes 1,532 user guesses along with the correctness of guesses after annotation. We thoroughly evaluated nine of the most advanced Large Language Models available today.
arXiv Detail & Related papers (2024-10-07T17:58:47Z)
Using ChatGPT to Score Essays and Short-Form Constructed Responses [0.0]
Investigation focused on various prediction models, including linear regression, random forest, gradient boost, and boost. ChatGPT's performance was evaluated against human raters using quadratic weighted kappa (QWK) metrics. Study concludes that ChatGPT can complement human scoring but requires additional development to be reliable for high-stakes assessments.
arXiv Detail & Related papers (2024-08-18T16:51:28Z)
Self-Training with Pseudo-Label Scorer for Aspect Sentiment Quad Prediction [54.23208041792073]
Aspect Sentiment Quad Prediction (ASQP) aims to predict all quads (aspect term, aspect category, opinion term, sentiment polarity) for a given review. A key challenge in the ASQP task is the scarcity of labeled data, which limits the performance of existing methods. We propose a self-training framework with a pseudo-label scorer, wherein a scorer assesses the match between reviews and their pseudo-labels.
arXiv Detail & Related papers (2024-06-26T05:30:21Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity. To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs. We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z)
The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from Human Feedback [5.037876196534672]
Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique to make large language models (LLMs) more capable in complex settings. In this paper, we illustrate the causes of this issue, reviewing relevant literature from model-based reinforcement learning, and argue for solutions.
arXiv Detail & Related papers (2023-10-31T21:52:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.