Efficient Self-Evaluation for Diffusion Language Models via Sequence Regeneration
- URL: http://arxiv.org/abs/2603.02760v1
- Date: Tue, 03 Mar 2026 08:58:20 GMT
- Title: Efficient Self-Evaluation for Diffusion Language Models via Sequence Regeneration
- Authors: Linhao Zhong, Linyu Wu, Wen Wang, Yuling Xi, Chenchen Jing, Jiaheng Zhang, Hao Chen, Chunhua Shen,
- Abstract summary: Diffusion large language models (dLLMs) have attracted significant attention for their ability to enhance diversity, controllability, and parallelism.<n>We propose DiSE, a simple yet effective self-evaluation confidence quantification method for dLLMs.
- Score: 48.19579266939883
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion large language models (dLLMs) have recently attracted significant attention for their ability to enhance diversity, controllability, and parallelism. However, their non-sequential, bidirectionally masked generation makes quality assessment difficult, underscoring the need for effective self-evaluation. In this work, we propose DiSE, a simple yet effective self-evaluation confidence quantification method for dLLMs. DiSE quantifies confidence by computing the probability of regenerating the tokens in the entire generated sequence, given the full context. This method enables more efficient and reliable quality assessment by leveraging token regeneration probabilities, facilitating both likelihood estimation and robust uncertainty quantification. Building upon DiSE, we further introduce a flexible-length generation framework, which adaptively controls the sequence length based on the model's self-assessment of its own output. We analyze and validate the feasibility of DiSE from the perspective of dLLM generalization, and empirically demonstrate that DiSE is positively correlated with both semantic coherence and answer accuracy. Extensive experiments on likelihood evaluation, uncertainty quantification, and flexible-length generation further confirm the effectiveness of the proposed DiSE.
Related papers
- VAUQ: Vision-Aware Uncertainty Quantification for LVLM Self-Evaluation [22.921677603408188]
Large Vision-Language Models (LVLMs) frequently hallucinate, limiting their safe deployment in real-world applications.<n>We propose VAUQ, a vision-aware uncertainty quantification framework for LVLM self-evaluation.<n>VAUQ explicitly measures how strongly a model's output depends on visual evidence.
arXiv Detail & Related papers (2026-02-24T16:11:14Z) - Trust in One Round: Confidence Estimation for Large Language Models via Structural Signals [13.89434979851652]
Large language models (LLMs) are increasingly deployed in domains where errors carry high social, scientific, or safety costs.<n>We present Structural Confidence, a single-pass, model-agnostic framework that enhances output correctness prediction.
arXiv Detail & Related papers (2026-02-01T02:35:59Z) - Mind the Generation Process: Fine-Grained Confidence Estimation During LLM Generation [63.49409574310576]
Large language models (LLMs) exhibit overconfidence, assigning high confidence scores to incorrect predictions.<n>We introduce FineCE, a novel confidence estimation method that delivers accurate, fine-grained confidence scores during text generation.<n>Our code and all baselines used in the paper are available on GitHub.
arXiv Detail & Related papers (2025-08-16T13:29:35Z) - A Context-Aware Dual-Metric Framework for Confidence Estimation in Large Language Models [6.62851757612838]
Current confidence estimation methods for large language models (LLMs) neglect the relevance between responses and contextual information.<n>We propose CRUX, which integrates context faithfulness and consistency for confidence estimation via two novel metrics.<n> Experiments across three benchmark datasets demonstrate CRUX's effectiveness, achieving the highest AUROC than existing baselines.
arXiv Detail & Related papers (2025-08-01T12:58:34Z) - Enhancing Uncertainty Estimation and Interpretability via Bayesian Non-negative Decision Layer [55.66973223528494]
We develop a Bayesian Non-negative Decision Layer (BNDL), which reformulates deep neural networks as a conditional Bayesian non-negative factor analysis.<n>BNDL can model complex dependencies and provide robust uncertainty estimation.<n>We also offer theoretical guarantees that BNDL can achieve effective disentangled learning.
arXiv Detail & Related papers (2025-05-28T10:23:34Z) - Scalable Best-of-N Selection for Large Language Models via Self-Certainty [75.1351701045874]
Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models (LLMs)<n>We propose self-certainty, a novel and efficient metric that leverages the inherent probability distribution of LLM outputs to estimate response quality without requiring external reward models.<n>Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities.
arXiv Detail & Related papers (2025-02-25T19:08:07Z) - Graph-based Confidence Calibration for Large Language Models [22.394717844099684]
We propose using an auxiliary learning model to assess response correctness based on the self-consistency of multiple outputs generated by the large language models.<n>Our method builds a consistency graph to represent the agreement among multiple responses and uses a graph neural network (GNN) to estimate the likelihood that each response is correct.
arXiv Detail & Related papers (2024-11-03T20:36:44Z) - Unconditional Truthfulness: Learning Unconditional Uncertainty of Large Language Models [104.55763564037831]
We train a regression model that leverages attention maps, probabilities on the current generation step, and recurrently computed uncertainty scores from previously generated tokens.<n>Our evaluation shows that the proposed method is highly effective for selective generation, achieving substantial improvements over rivaling unsupervised and supervised approaches.
arXiv Detail & Related papers (2024-08-20T09:42:26Z) - Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks.
We instruct an LLM to self-evaluate its answers.
We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.