Related papers: QoNext: Towards Next-generation QoE for Foundation Models

QoNext: Towards Next-generation QoE for Foundation Models

URL: http://arxiv.org/abs/2509.21889v2
Date: Thu, 09 Oct 2025 13:06:14 GMT
Title: QoNext: Towards Next-generation QoE for Foundation Models
Authors: Yijin Guo, Zicheng Zhang, Ye Shen, Farong Wen, Junying Wang, Qi Jia, Guangtao Zhai,
Abstract summary: Existing evaluations of foundation models fail to capture what truly matters: user's experience during interaction.<n>We introduce QoNext, the first framework that adapts Quality of Experience principles to the assessment of foundation models.<n>We construct a QoE-oriented database and train predictive models that estimate perceived user experience from measurable system parameters.
Score: 63.76972456980632
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing evaluations of foundation models, including recent human-centric approaches, fail to capture what truly matters: user's experience during interaction. Current methods treat evaluation as a matter of output correctness alone, overlooking that user satisfaction emerges from the interplay between response quality and interaction, which limits their ability to account for the mechanisms underlying user experience. To address this gap, we introduce QoNext, the first framework that adapts Quality of Experience (QoE) principles from networking and multimedia to the assessment of foundation models. QoNext identifies experiential factors that shape user experience and incorporates them into controlled experiments, where human ratings are collected under varied configurations. From these studies we construct a QoE-oriented database and train predictive models that estimate perceived user experience from measurable system parameters. Our results demonstrate that QoNext not only enables proactive and fine-grained evaluation but also provides actionable guidance for productized services of optimizing foundation models in practice.

Related papers

IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation [85.56193980646981]
We propose IF-RewardBench, a comprehensive meta-evaluation benchmark for instruction-following.<n>For each instruction, we construct a preference graph containing all pairwise preferences among multiple responses.<n>Experiments on IF-RewardBench reveal significant deficiencies in current judge models.
arXiv Detail & Related papers (2026-03-05T02:21:17Z)
OmniQuality-R: Advancing Reward Models Through All-Encompassing Quality Assessment [55.59322229889159]
We propose OmniQuality-R, a unified reward modeling framework that transforms multi-task quality reasoning into continuous and interpretable reward signals.<n>We use a reasoning-enhanced reward modeling dataset to form a reliable chain-of-thought dataset for supervised fine-tuning.<n>We evaluate OmniQuality-R on three key IQA tasks: aesthetic quality assessment, technical quality evaluation, and text-image alignment.
arXiv Detail & Related papers (2025-10-12T13:46:28Z)
ExGRPO: Learning to Reason from Experience [82.83309610498446]
Reinforcement learning from verifiable rewards (RLVR) is an emerging paradigm for improving the reasoning ability of large language models.<n>Standard on-policy training discards rollout experiences after a single update, leading to computational inefficiency and instability.<n>In this paper, we are the first to investigate what makes a reasoning experience valuable and identify rollout correctness and entropy as effective indicators of experience value.
arXiv Detail & Related papers (2025-10-02T17:31:30Z)
EvolvR: Self-Evolving Pairwise Reasoning for Story Evaluation to Enhance Generation [17.37840331449749]
We propose a self-Evolving Pairwise Reasoning (EvolvR) framework for story evaluation.<n>The framework first self-synthesizes score-aligned Chain-of-Thought (CoT) data via a multi-persona strategy.<n>The evaluator trained on the refined data is deployed as a reward model to guide the story generation task.
arXiv Detail & Related papers (2025-08-08T06:10:47Z)
Human-in-the-loop online just-in-time software defect prediction [6.35776510153759]
We propose Human-In-The-Loop (HITL) O-JIT-SDP that integrates feedback from SQA staff to enhance the prediction process. We also introduce a performance evaluation framework that utilizes a k-fold distributed bootstrap method along with the Wilcoxon signed-rank test. These advancements hold the potential to significantly enhance the value of O-JIT-SDP for industrial applications.
arXiv Detail & Related papers (2023-08-25T23:40:08Z)
MMBench: Is Your Multi-modal Model an All-around Player? [114.45702807380415]
We propose MMBench, a benchmark for assessing the multi-modal capabilities of vision-language models. MMBench is meticulously curated with well-designed quality control schemes. MMBench incorporates multiple-choice questions in both English and Chinese versions.
arXiv Detail & Related papers (2023-07-12T16:23:09Z)
Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
Justification of Recommender Systems Results: A Service-based Approach [4.640835690336653]
We propose a novel justification approach that uses service models to extract experience data from reviews concerning all the stages of interaction with items. In a user study, we compared our approach with baselines reflecting the state of the art in the justification of recommender systems results. Our models received higher Interface Adequacy and Satisfaction evaluations by users having different levels of Curiosity or low Need for Cognition (NfC) These findings encourage the adoption of service models to justify recommender systems results but suggest the investigation of personalization strategies to suit diverse interaction needs.
arXiv Detail & Related papers (2022-11-07T11:08:19Z)
QAFactEval: Improved QA-Based Factual Consistency Evaluation for Summarization [116.56171113972944]
We show that carefully choosing the components of a QA-based metric is critical to performance. Our solution improves upon the best-performing entailment-based metric and achieves state-of-the-art performance.
arXiv Detail & Related papers (2021-12-16T00:38:35Z)
Post-hoc Models for Performance Estimation of Machine Learning Inference [22.977047604404884]
Estimating how well a machine learning model performs during inference is critical in a variety of scenarios. We systematically generalize performance estimation to a diverse set of metrics and scenarios. We find that proposed post-hoc models consistently outperform the standard confidence baselines.
arXiv Detail & Related papers (2021-10-06T02:20:37Z)
Study on the Assessment of the Quality of Experience of Streaming Video [117.44028458220427]
In this paper, the influence of various objective factors on the subjective estimation of the QoE of streaming video is studied. The paper presents standard and handcrafted features, shows their correlation and p-Value of significance. We take SQoE-III database, so far the largest and most realistic of its kind.
arXiv Detail & Related papers (2020-12-08T18:46:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.