Generation Space Size: Understanding and Calibrating Open-Endedness of LLM Generations
- URL: http://arxiv.org/abs/2510.12699v1
- Date: Tue, 14 Oct 2025 16:31:34 GMT
- Title: Generation Space Size: Understanding and Calibrating Open-Endedness of LLM Generations
- Authors: Sunny Yu, Ahmad Jabbar, Robert Hawkins, Dan Jurafsky, Myra Cheng,
- Abstract summary: We argue that effective generation space size (GSS) is the set of semantically distinct outputs a model considers for a prompt.<n>We present GSSBench, a task suite of prompt pairs with ground-truth GSS relationships to assess different metrics.<n>We find that hallucination detection metrics, particularly EigenScore, consistently outperform standard diversity and uncertainty quantification metrics.
- Score: 30.476953783731307
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Different open-ended generation tasks require different degrees of output diversity. However, current LLMs are often miscalibrated. They collapse to overly homogeneous outputs for creative tasks and hallucinate diverse but incorrect responses for factual tasks. We argue that these two failure modes are unified by, and can both be addressed by, the notion of effective generation space size (GSS) -- the set of semantically distinct outputs a model considers for a prompt. We present GSSBench, a task suite of prompt pairs with ground-truth GSS relationships to assess different metrics and understand where models diverge from desired behavior. We find that hallucination detection metrics, particularly EigenScore, consistently outperform standard diversity and uncertainty quantification metrics, while using only model internals, providing interpretable insights into a model's internal task representations. We demonstrate three applications of GSS: (1) detecting prompt ambiguity and predicting clarification questions for better grounding, (2) interpreting overthinking and underthinking in reasoning models, and (3) steering models to expand their generation space to yield high-quality and diverse outputs.
Related papers
- UniG2U-Bench: Do Unified Models Advance Multimodal Understanding? [50.92401586025528]
Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear.<n>We introduce UniG2U-Bench, a comprehensive benchmark categorizing generation-to-understanding (G2U) evaluation into 7 regimes and 30 subtasks.
arXiv Detail & Related papers (2026-03-03T18:36:16Z) - D-Models and E-Models: Diversity-Stability Trade-offs in the Sampling Behavior of Large Language Models [91.21455683212224]
In large language models (LLMs), the probability of relevance for the next piece of information is linked to the probability of relevance for the next product.<n>But whether fine-grained sampling probabilities faithfully align with task requirements remains an open question.<n>We identify two model types: D-models, whose P_token exhibits large step-to-step variability and poor alignment with P_task; and E-models, whose P_token is more stable and better aligned with P_task.
arXiv Detail & Related papers (2026-01-25T14:59:09Z) - Ground What You See: Hallucination-Resistant MLLMs via Caption Feedback, Diversity-Aware Sampling, and Conflict Regularization [38.469173375694076]
This paper systematically analyzes the root causes of hallucinations in Multimodal Large Language Models (MLLMs)<n>It identifies three critical factors: (1) an over-reliance on chained visual reasoning, where inaccurate initial descriptions anchor subsequent inferences to incorrect premises; (2) insufficient exploration diversity during policy optimization, leading the model to generate overly confident but erroneous outputs; and (3) destructive conflicts between training samples, where NTK similarity causes false associations and unstable parameter updates.<n> Experimental results demonstrate that our proposed method significantly reduces hallucination rates and effectively enhances the inference accuracy of MLLMs.
arXiv Detail & Related papers (2026-01-09T07:59:18Z) - Self-Correcting Large Language Models: Generation vs. Multiple Choice [29.697851249014192]
Large language models have recently demonstrated remarkable abilities to self-correct their responses through iterative refinement.<n>We compare performance trends and error-correction behaviors across various natural language understanding and reasoning tasks.<n>Our findings highlight that the design of self-correction mechanisms should take into account the interaction between task structure and output space.
arXiv Detail & Related papers (2025-11-12T14:46:40Z) - Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond) [90.45301024940329]
Language models (LMs) often struggle to generate diverse, human-like creative content.<n>We introduce Infinity-Chat, a large-scale dataset of 26K diverse, real-world, open-ended user queries.<n>We present a large-scale study of mode collapse in LMs, revealing a pronounced Artificial Hivemind effect.
arXiv Detail & Related papers (2025-10-27T03:16:21Z) - SUDER: Self-Improving Unified Large Multimodal Models for Understanding and Generation with Dual Self-Rewards [55.99492656542475]
We propose textbfSUDER (textbfSelf-improving textbfUnified LMMs with textbfDual stextbfElf-textbfRewards), a framework reinforcing the understanding and generation capabilities of LMMs.
arXiv Detail & Related papers (2025-06-09T17:38:45Z) - ImpRAG: Retrieval-Augmented Generation with Implicit Queries [34.72864597562907]
ImpRAG is a query-free RAG system that integrates retrieval and generation into a unified model.<n>We show that ImpRAG achieves 3.6-11.5 improvements in exact match scores on unseen tasks with diverse formats.
arXiv Detail & Related papers (2025-06-02T21:38:21Z) - The Price of Format: Diversity Collapse in LLMs [32.616432249190716]
Large language models (LLMs) employ structured templates, such as role markers and special tokens, to enforce format consistency during inference.<n>We systematically evaluate this effect across tasks like story completion and free-form generation, finding that diversity collapse persists even under high-temperature sampling.<n>To contextualize these findings, we fine-tune the same model using a range of structured prompts and then evaluate them across three axes: downstream task performance, alignment behavior, and output diversity.
arXiv Detail & Related papers (2025-05-25T02:52:35Z) - Spatial-Temporal-Spectral Unified Modeling for Remote Sensing Dense Prediction [20.1863553357121]
Current deep learning architectures for remote sensing are fundamentally rigid.<n>We introduce the Spatial-Temporal-Spectral Unified Network (STSUN) for unified modeling.<n> STSUN can adapt to input and output data with arbitrary spatial sizes, temporal lengths, and spectral bands.<n>It unifies various dense prediction tasks and diverse semantic class predictions.
arXiv Detail & Related papers (2025-05-18T07:39:17Z) - Unifying Search and Recommendation with Dual-View Representation Learning in a Generative Paradigm [51.2624255871896]
GenSR is a novel generative paradigm for unifying search and recommendation.<n>Our work introduces a new generative paradigm compared with previous discriminative methods.
arXiv Detail & Related papers (2025-04-09T09:15:37Z) - Semantic Density: Uncertainty Quantification for Large Language Models through Confidence Measurement in Semantic Space [14.715989394285238]
Existing Large Language Models (LLMs) do not have an inherent functionality to provide the users with an uncertainty/confidence metric for each response it generates.
A new framework is proposed in this paper to address these issues.
Semantic density extracts uncertainty/confidence information for each response from a probability distribution perspective in semantic space.
arXiv Detail & Related papers (2024-05-22T17:13:49Z) - Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds [59.71218039095155]
We evaluate language understanding capacities on simple inference tasks that most humans find trivial.
We target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments.
The models exhibit moderate to low performance on these evaluation sets.
arXiv Detail & Related papers (2023-05-24T06:41:09Z) - Exposing and Addressing Cross-Task Inconsistency in Unified
Vision-Language Models [80.23791222509644]
Inconsistent AI models are considered brittle and untrustworthy by human users.
We find that state-of-the-art vision-language models suffer from a surprisingly high degree of inconsistent behavior across tasks.
We propose a rank correlation-based auxiliary training objective, computed over large automatically created cross-task contrast sets.
arXiv Detail & Related papers (2023-03-28T16:57:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.