Related papers: Human-Centric Evaluation for Foundation Models

Human-Centric Evaluation for Foundation Models

URL: http://arxiv.org/abs/2506.01793v1
Date: Mon, 02 Jun 2025 15:33:29 GMT
Title: Human-Centric Evaluation for Foundation Models
Authors: Yijin Guo, Kaiyuan Ji, Xiaorong Zhu, Junying Wang, Farong Wen, Chunyi Li, Zicheng Zhang, Guangtao Zhai,
Abstract summary: We propose a Human-Centric subjective Evaluation framework, focusing on three core dimensions: problem-solving ability, information quality, and interaction experience.<n>We conduct over 540 participant-driven evaluations, where humans and models collaborate on open-ended research tasks.<n>Our findings highlight Grok 3's superior performance, followed by Deepseek R1 and Gemini 2.5, with OpenAI o3 mini lagging behind.
Score: 31.400215906308546
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Currently, nearly all evaluations of foundation models focus on objective metrics, emphasizing quiz performance to define model capabilities. While this model-centric approach enables rapid performance assessment, it fails to reflect authentic human experiences. To address this gap, we propose a Human-Centric subjective Evaluation (HCE) framework, focusing on three core dimensions: problem-solving ability, information quality, and interaction experience. Through experiments involving Deepseek R1, OpenAI o3 mini, Grok 3, and Gemini 2.5, we conduct over 540 participant-driven evaluations, where humans and models collaborate on open-ended research tasks, yielding a comprehensive subjective dataset. This dataset captures diverse user feedback across multiple disciplines, revealing distinct model strengths and adaptability. Our findings highlight Grok 3's superior performance, followed by Deepseek R1 and Gemini 2.5, with OpenAI o3 mini lagging behind. By offering a novel framework and a rich dataset, this study not only enhances subjective evaluation methodologies but also lays the foundation for standardized, automated assessments, advancing LLM development for research and practical scenarios. Our dataset link is https://github.com/yijinguo/Human-Centric-Evaluation.

Related papers

Minos: A Multimodal Evaluation Model for Bidirectional Generation Between Image and Text [51.149562188883486]
We introduce Minos-Corpus, a large-scale multimodal evaluation dataset that combines evaluation data from both human and GPT.<n>Based on this corpus, we propose Data Selection and Balance, Mix-SFT training methods, and apply DPO to develop Minos.
arXiv Detail & Related papers (2025-06-03T06:17:16Z)
Multimodal Foundation Model for Cross-Modal Retrieval and Activity Recognition Tasks [3.1976901430982063]
We propose a foundational model integrating four modalities: third-person video, motion capture, IMU, and text.<n>By incorporating third-person video and motion capture data, the model enables a detailed and multidimensional understanding of human activity.
arXiv Detail & Related papers (2025-05-29T01:47:43Z)
From Rankings to Insights: Evaluation Should Shift Focus from Leaderboard to Feedback [36.68929551237421]
We introduce bftextFeedbacker, an evaluation framework that provides comprehensive and fine-grained results.<n>Our project homepage and dataset are available at https://liudan193.io/Feedbacker.
arXiv Detail & Related papers (2025-05-10T16:52:40Z)
UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation [36.40760924116748]
Multimodal Large Language Models (MLLMs) have emerged to tackle the challenges of Visual Question Answering (VQA)<n>Existing evaluation methods face limitations due to the significant human workload required to design Q&A pairs for visual images.<n>We propose an Unsupervised Peer review MLLM Evaluation framework, which allows models to automatically generate questions and conduct peer review assessments of answers from other models.
arXiv Detail & Related papers (2025-03-19T07:15:41Z)
CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution [74.41064280094064]
textbfJudger-1 is the first open-source textbfall-in-one judge LLM. CompassJudger-1 is a general-purpose LLM that demonstrates remarkable versatility. textbfJudgerBench is a new benchmark that encompasses various subjective evaluation tasks.
arXiv Detail & Related papers (2024-10-21T17:56:51Z)
Tool-Augmented Reward Modeling [58.381678612409]
We propose a tool-augmented preference modeling approach, named Themis, to address limitations by empowering RMs with access to external environments. Our study delves into the integration of external tools into RMs, enabling them to interact with diverse external sources. In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines.
arXiv Detail & Related papers (2023-10-02T09:47:40Z)
Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale. We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units. We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z)
Benchmarking and Analyzing 3D Human Pose and Shape Estimation Beyond Algorithms [31.2529724533643]
This work presents the first comprehensive benchmarking study from three under-explored perspectives beyond algorithms. An analysis on 31 datasets reveals the distinct impacts of data samples. We achieve a PA-MPJPE of 47.3 mm on the 3DPW test set with a relatively simple model.
arXiv Detail & Related papers (2022-09-21T17:39:53Z)
Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation. Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)
Group-Level Emotion Recognition Using a Unimodal Privacy-Safe Non-Individual Approach [0.0]
This article presents our unimodal privacy-safe and non-individual proposal for the audio-video group emotion recognition subtask at the Emotion Recognition in the Wild (EmotiW) Challenge 2020 1.
arXiv Detail & Related papers (2020-09-15T12:25:33Z)
SummEval: Re-evaluating Summarization Evaluation [169.622515287256]
We re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion. We benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics. We assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset.
arXiv Detail & Related papers (2020-07-24T16:25:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.