Related papers: InFerActive: Towards Scalable Human Evaluation of Large Language Models through Interactive Inference

InFerActive: Towards Scalable Human Evaluation of Large Language Models through Interactive Inference

URL: http://arxiv.org/abs/2512.10234v1
Date: Thu, 11 Dec 2025 02:41:14 GMT
Title: InFerActive: Towards Scalable Human Evaluation of Large Language Models through Interactive Inference
Authors: Junhyeong Hwangbo, Soohyun Lee, Minsoo Cheong, Hyeon Jeon, Jinwook Seo,
Abstract summary: We present InFerActive, an interactive inference system for scalable human evaluation.<n>We demonstrate that InFerActive significantly improves evaluation efficiency and enables more comprehensive assessment of model behavior.
Score: 14.903507875179033
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Human evaluation remains the gold standard for evaluating outputs of Large Language Models (LLMs). The current evaluation paradigm reviews numerous individual responses, leading to significant scalability challenges. LLM outputs can be more efficiently represented as a tree structure, reflecting their autoregressive generation process and stochastic token selection. However, conventional tree visualization cannot scale to the exponentially large trees generated by modern sampling methods of LLMs. To address this problem, we present InFerActive, an interactive inference system for scalable human evaluation. InFerActive enables on-demand exploration through probability-based filtering and evaluation features, while bridging the semantic gap between computational tokens and human-readable text through adaptive visualization techniques. Through a technical evaluation and user study (N=12), we demonstrate that InFerActive significantly improves evaluation efficiency and enables more comprehensive assessment of model behavior. We further conduct expert case studies that demonstrate InFerActive's practical applicability and potential for transforming LLM evaluation workflows.

Related papers

Objective Metrics for Evaluating Large Language Models Using External Data Sources [4.574672973076743]
This paper proposes a framework for leveraging subjective metrics derived from the class textual materials across different semesters.<n>The framework emphasizes automation and transparency in scoring, reducing reliance on human interpretation.<n>This method addresses the limitations of subjective evaluation methods, providing a scalable solution for performance assessment in educational, scientific, and other high-stakes domains.
arXiv Detail & Related papers (2025-08-01T02:24:19Z)
Monocle: Hybrid Local-Global In-Context Evaluation for Long-Text Generation with Uncertainty-Based Active Learning [63.531262595858]
Divide-and-conquer approach breaks comprehensive evaluation task into localized scoring tasks, followed by a final global assessment.<n>We introduce a hybrid in-context learning approach that leverages human annotations to enhance the performance of both local and global evaluations.<n>Finally, we develop an uncertainty-based active learning algorithm that efficiently selects data samples for human annotation.
arXiv Detail & Related papers (2025-05-26T16:39:41Z)
Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z)
Maximizing Signal in Human-Model Preference Alignment [0.0]
This paper argues that in cases in which end users need to agree with the decisions made by ML models, models should be trained and evaluated on data that represent their preferences.<n>We show that noise in labeling disagreement can be minimized by adhering to proven methodological best practices.
arXiv Detail & Related papers (2025-03-06T19:10:57Z)
PanguIR Technical Report for NTCIR-18 AEOLLM Task [12.061652026366591]
Large language models (LLMs) are increasingly critical and challenging to evaluate.<n>Manual evaluation, while comprehensive, is often costly and resource-intensive.<n>automatic evaluation offers greater scalability but is constrained by the limitations of its evaluation criteria.
arXiv Detail & Related papers (2025-03-04T07:40:02Z)
Towards More Effective Table-to-Text Generation: Assessing In-Context Learning and Self-Evaluation with Open-Source Models [0.0]
This study explores the effectiveness of various in-context learning strategies in language models (LMs) across benchmark datasets. We employ a large language model (LLM) self-evaluation approach using chain-of-thought reasoning and assess its correlation with human-aligned metrics like BERTScore. Our findings highlight the significant impact of examples in improving table-to-text generation and suggest that, while LLM self-evaluation has potential, its current alignment with human judgment could be enhanced.
arXiv Detail & Related papers (2024-10-15T09:19:42Z)
Understanding Large Language Model Behaviors through Interactive Counterfactual Generation and Analysis [22.755345889167934]
We present an interactive visualization system that enables exploration of large language models (LLMs) through counterfactual analysis.<n>Our system features a novel algorithm that generates fluent and semantically meaningful counterfactuals.<n>A user study with LLM practitioners and interviews with experts demonstrate the system's usability and effectiveness.
arXiv Detail & Related papers (2024-04-23T19:57:03Z)
F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods [102.98899881389211]
We propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic. For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models.
arXiv Detail & Related papers (2024-01-26T13:55:32Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation. We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z)
Rethinking the Evaluation for Conversational Recommendation in the Era of Large Language Models [115.7508325840751]
The recent success of large language models (LLMs) has shown great potential to develop more powerful conversational recommender systems (CRSs) In this paper, we embark on an investigation into the utilization of ChatGPT for conversational recommendation, revealing the inadequacy of the existing evaluation protocol. We propose an interactive Evaluation approach based on LLMs named iEvaLM that harnesses LLM-based user simulators.
arXiv Detail & Related papers (2023-05-22T15:12:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.