Do Chatbot LLMs Talk Too Much? The YapBench Benchmark
- URL: http://arxiv.org/abs/2601.00624v1
- Date: Fri, 02 Jan 2026 09:43:52 GMT
- Title: Do Chatbot LLMs Talk Too Much? The YapBench Benchmark
- Authors: Vadim Borisov, Michael Gröger, Mina Mikhael, Richard H. Schreiber,
- Abstract summary: YapBench is a benchmark for quantifying user-visible over-generation on brevity-ideal prompts.<n>Each item consists of a single-turn prompt, a curated minimal-sufficient baseline answer, and a category label.<n>We summarize model performance via the YapIndex, a uniformly weighted average of category-level median YapScores.
- Score: 1.6149401958316794
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) such as ChatGPT, Claude, and Gemini increasingly act as general-purpose copilots, yet they often respond with unnecessary length on simple requests, adding redundant explanations, hedging, or boilerplate that increases cognitive load and inflates token-based inference cost. Prior work suggests that preference-based post-training and LLM-judged evaluations can induce systematic length bias, where longer answers are rewarded even at comparable quality. We introduce YapBench, a lightweight benchmark for quantifying user-visible over-generation on brevity-ideal prompts. Each item consists of a single-turn prompt, a curated minimal-sufficient baseline answer, and a category label. Our primary metric, YapScore, measures excess response length beyond the baseline in characters, enabling comparisons across models without relying on any specific tokenizer. We summarize model performance via the YapIndex, a uniformly weighted average of category-level median YapScores. YapBench contains over three hundred English prompts spanning three common brevity-ideal settings: (A) minimal or ambiguous inputs where the ideal behavior is a short clarification, (B) closed-form factual questions with short stable answers, and (C) one-line coding tasks where a single command or snippet suffices. Evaluating 76 assistant LLMs, we observe an order-of-magnitude spread in median excess length and distinct category-specific failure modes, including vacuum-filling on ambiguous inputs and explanation or formatting overhead on one-line technical requests. We release the benchmark and maintain a live leaderboard for tracking verbosity behavior over time.
Related papers
- ABCD: All Biases Come Disguised [4.603755953026689]
Multiple-choice question (MCQ) benchmarks have been a standard evaluation practice.<n>We propose a simple bias-reduced evaluation protocol that replaces the labels of each question with uniform, unordered labels.<n>We show that this protocol substantially improves the robustness to answer permutations, reducing mean accuracy variance $3times$ with only a minimal decrease in the mean model's performance.
arXiv Detail & Related papers (2026-02-19T15:12:33Z) - LikeBench: Evaluating Subjective Likability in LLMs for Personalization [11.75597537798083]
We argue that a third axis, likability, is both subjective and central to user experience, yet under-measured by current benchmarks.<n>We introduce LikeBench, a multi-session, dynamic evaluation framework that measures likability across multiple dimensions.<n>Our benchmark shows that strong memory performance does not guarantee high likability: DeepSeek R1, with lower memory accuracy (86%, 17 facts/profile), outperformed Qwen3 by 28% on likability score despite Qwen3's higher memory accuracy (93%, 43 facts/profile)<n>Even SOTA models like GPT-5 adapt well in short
arXiv Detail & Related papers (2025-12-15T08:18:42Z) - Behavior-Equivalent Token: Single-Token Replacement for Long Prompts in LLMs [55.827877498548965]
We propose a lightweight training framework that learns a single prompt-specific Behavior-Equivalent token ([BE])<n>The framework first trains [BE] to encode the natural-language content of the original system prompt via reconstruction, and then distills the prompt's downstream behavior into this single token.<n> Empirical evaluations on three datasets show that a single [BE] token achieves up to a 3000x reduction in prompt length, while retaining about 98% of the downstream performance of the original system prompts.
arXiv Detail & Related papers (2025-11-28T15:22:52Z) - DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models [13.242009624334996]
DynaSpec is a dynamic shortlisting mechanism that is robust, speeds up drafting, and generalizes across diverse tasks.<n>It delivers consistent improvements in mean accepted length, for Llama-3-8B, reaching upto 98.2% of full-vocabulary performance.<n>By leveraging context-dependent selection, DynaSpec achieves up to a 2.18 times increase in generated tokens compared to 1.91 times for fixed-vocabulary approaches.
arXiv Detail & Related papers (2025-10-11T19:38:07Z) - Prompt-Based One-Shot Exact Length-Controlled Generation with LLMs [56.47577824219207]
We present a prompt-based strategy that compels an off-the-shelf large language model to generate exactly a desired number of tokens.<n>The prompt appends countdown markers and explicit counting rules so that the model "writes while counting"<n>On MT-Bench-LI, strict length compliance with GPT-4.1 leaps from below 30% under naive prompts to above 95% with our countdown prompt.
arXiv Detail & Related papers (2025-08-19T13:12:01Z) - CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z) - MinosEval: Distinguishing Factoid and Non-Factoid for Tailored Open-Ended QA Evaluation with LLMs [15.278241998033822]
Open-ended question answering (QA) is a key task for evaluating the capabilities of large language models (LLMs)<n>We propose textbfMinosEval, a novel evaluation method that first distinguishes open-ended questions and then ranks candidate answers.
arXiv Detail & Related papers (2025-06-18T07:49:13Z) - Reducing the Scope of Language Models [7.464494269745494]
We show that it is possible to scope language models.<n>We ablate diversity of irrelevant queries, layer different techniques, conduct adversarial evaluations.<n>We intend our study to serve as a practitioner's guide to scoping language models.
arXiv Detail & Related papers (2024-10-28T23:06:57Z) - BYOC: Personalized Few-Shot Classification with Co-Authored Class
Descriptions [2.076173115539025]
We propose a novel approach to few-shot text classification using an LLM.
Rather than few-shot examples, the LLM is prompted with descriptions of the salient features of each class.
Examples, questions, and answers are summarized to form the classification prompt.
arXiv Detail & Related papers (2023-10-09T19:37:38Z) - Answering Ambiguous Questions via Iterative Prompting [84.3426020642704]
In open-domain question answering, due to the ambiguity of questions, multiple plausible answers may exist.
One approach is to directly predict all valid answers, but this can struggle with balancing relevance and diversity.
We present AmbigPrompt to address the imperfections of existing approaches to answering ambiguous questions.
arXiv Detail & Related papers (2023-07-08T04:32:17Z) - Large Language Models are not Fair Evaluators [60.27164804083752]
We find that the quality ranking of candidate responses can be easily hacked by altering their order of appearance in the context.
This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other.
We propose a framework with three simple yet effective strategies to mitigate this issue.
arXiv Detail & Related papers (2023-05-29T07:41:03Z) - BUMP: A Benchmark of Unfaithful Minimal Pairs for Meta-Evaluation of
Faithfulness Metrics [70.52570641514146]
We present a benchmark of unfaithful minimal pairs (BUMP)
BUMP is a dataset of 889 human-written, minimally different summary pairs.
Unlike non-pair-based datasets, BUMP can be used to measure the consistency of metrics.
arXiv Detail & Related papers (2022-12-20T02:17:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.