Transparent Reference-free Automated Evaluation of Open-Ended User Survey Responses
- URL: http://arxiv.org/abs/2510.06242v1
- Date: Fri, 03 Oct 2025 08:37:33 GMT
- Title: Transparent Reference-free Automated Evaluation of Open-Ended User Survey Responses
- Authors: Subin An, Yugyeong Ji, Junyoung Kim, Heejin Kook, Yang Lu, Josh Seltzer,
- Abstract summary: Open-ended survey responses provide valuable insights in marketing research.<n>Low-quality responses not only burden researchers with manual filtering but also risk leading to misleading conclusions.<n>We propose a two-stage evaluation framework specifically designed for human survey responses.
- Score: 7.295969279816647
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Open-ended survey responses provide valuable insights in marketing research, but low-quality responses not only burden researchers with manual filtering but also risk leading to misleading conclusions, underscoring the need for effective evaluation. Existing automatic evaluation methods target LLM-generated text and inadequately assess human-written responses with their distinct characteristics. To address such characteristics, we propose a two-stage evaluation framework specifically designed for human survey responses. First, gibberish filtering removes nonsensical responses. Then, three dimensions-effort, relevance, and completeness-are evaluated using LLM capabilities, grounded in empirical analysis of real-world survey data. Validation on English and Korean datasets shows that our framework not only outperforms existing metrics but also demonstrates high practical applicability for real-world applications such as response quality prediction and response rejection, showing strong correlations with expert assessment.
Related papers
- DREAM: Deep Research Evaluation with Agentic Metrics [21.555357444628044]
We propose DREAM (Deep Research Evaluation with Agentic Metrics), a framework that makes evaluation itself agentic.<n> DREAM structures assessment through an evaluation protocol combining query-agnostic metrics with adaptive metrics generated by a tool-calling agent.<n>Controlled evaluations demonstrate DREAM is significantly more sensitive to factual and temporal decay than existing benchmarks.
arXiv Detail & Related papers (2026-02-21T19:14:31Z) - Author-in-the-Loop Response Generation and Evaluation: Integrating Author Expertise and Intent in Responses to Peer Review [53.99984738447279]
Recent work frames this task as automatic text generation, underusing author expertise and intent.<n>We introduce REspGen, a generation framework that integrates explicit author input, multi-attribute control, and evaluation-guided refinement.<n>To support this formulation, we construct Re$3$Align, the first large-scale dataset of aligned review-response--revision triplets.
arXiv Detail & Related papers (2026-01-19T14:07:10Z) - DeepSurvey-Bench: Evaluating Academic Value of Automatically Generated Scientific Survey [53.85391477976017]
DeepSurvey-Bench is a novel benchmark designed to comprehensively evaluate the academic value of generated surveys.<n>We construct a reliable dataset with academic value annotations, and evaluate the deep academic value of the generated surveys.
arXiv Detail & Related papers (2026-01-13T14:42:56Z) - Comprehensiveness Metrics for Automatic Evaluation of Factual Recall in Text Generation [46.697788643450785]
Large language models (LLMs) have been found to produce outputs that are incomplete or selectively omit key information.<n>In sensitive domains, such omissions can result in significant harm comparable to that posed by factual inaccuracies.
arXiv Detail & Related papers (2025-10-09T08:22:24Z) - Analysis of instruction-based LLMs' capabilities to score and judge text-input problems in an academic setting [0.7699714865575188]
Large language models (LLMs) can act as evaluators, a role studied by methods like LLM-as-a-Judge and fine-tuned judging LLMs.<n>We propose five evaluation systems that have been tested on a custom dataset of 110 answers about computer science from higher education students with three models: JudgeLM, Llama-3.1-8B and DeepSeek-R1-Distill-Llama-8B.<n>With the lowest median absolute deviation (0.945) and the lowest root mean square deviation (1.214) when compared to human evaluation, Reference Aided Evaluation offers fair scoring as well as insightful and complete evaluations
arXiv Detail & Related papers (2025-09-25T10:26:23Z) - Beyond "Not Novel Enough": Enriching Scholarly Critique with LLM-Assisted Feedback [81.0031690510116]
We present a structured approach for automated novelty evaluation that models expert reviewer behavior through three stages.<n>Our method is informed by a large scale analysis of human written novelty reviews.<n> Evaluated on 182 ICLR 2025 submissions, the approach achieves 86.5% alignment with human reasoning and 75.3% agreement on novelty conclusions.
arXiv Detail & Related papers (2025-08-14T16:18:37Z) - EvalAgent: Discovering Implicit Evaluation Criteria from the Web [82.82096383262068]
We introduce EvalAgent, a framework designed to automatically uncover nuanced and task-specific criteria.<n>EvalAgent mines expert-authored online guidance to propose diverse, long-tail evaluation criteria.<n>Our experiments demonstrate that the grounded criteria produced by EvalAgent are often implicit, yet specific.
arXiv Detail & Related papers (2025-04-21T16:43:50Z) - HREF: Human Response-Guided Evaluation of Instruction Following in Language Models [61.273153125847166]
We develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction Following (HREF)<n>In addition to providing reliable evaluation, HREF emphasizes individual task performance and is free from contamination.<n>We study the impact of key design choices in HREF, including the size of the evaluation set, the judge model, the baseline model, and the prompt template.
arXiv Detail & Related papers (2024-12-20T03:26:47Z) - Towards Understanding the Robustness of LLM-based Evaluations under Perturbations [9.944512689015998]
Large Language Models (LLMs) can serve as automatic evaluators for non-standardized metrics in summarization and dialog-based tasks.<n>We conduct experiments across multiple prompting strategies to examine how LLMs fare as quality evaluators when compared with human judgments.
arXiv Detail & Related papers (2024-12-12T13:31:58Z) - An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation [29.81362106367831]
Existing evaluation methods often suffer from high costs, limited test formats, the need of human references, and systematic evaluation biases.
In contrast to previous studies that rely on human annotations, Auto-PRE selects evaluators automatically based on their inherent traits.
Experimental results indicate our Auto-PRE achieves state-of-the-art performance at a lower cost.
arXiv Detail & Related papers (2024-10-16T06:06:06Z) - RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [66.93260816493553]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios.<n>With a focus on factual accuracy, we propose three novel metrics: Completeness, Hallucination, and Irrelevance.<n> Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z) - Self-Improving Customer Review Response Generation Based on LLMs [1.9274286238176854]
SCRABLE represents an adaptive customer review response automation that enhances itself with self-optimizing prompts.
We introduce an automatic scoring mechanism that mimics the role of a human evaluator to assess the quality of responses generated in customer review domains.
arXiv Detail & Related papers (2024-05-06T20:50:17Z) - Word-Level ASR Quality Estimation for Efficient Corpus Sampling and
Post-Editing through Analyzing Attentions of a Reference-Free Metric [5.592917884093537]
The potential of quality estimation (QE) metrics is introduced and evaluated as a novel tool to enhance explainable artificial intelligence (XAI) in ASR systems.
The capabilities of the NoRefER metric are explored in identifying word-level errors to aid post-editors in refining ASR hypotheses.
arXiv Detail & Related papers (2024-01-20T16:48:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.