ELABORATION: A Comprehensive Benchmark on Human-LLM Competitive Programming
- URL: http://arxiv.org/abs/2505.16667v1
- Date: Thu, 22 May 2025 13:32:39 GMT
- Title: ELABORATION: A Comprehensive Benchmark on Human-LLM Competitive Programming
- Authors: Xinwei Yang, Zhaofeng Liu, Chen Huang, Jiashuai Zhang, Tong Zhang, Yifan Zhang, Wenqiang Lei,
- Abstract summary: We present the first taxonomy of human feedback consolidating the entire programming process.<n>We also introduce ELABORATION, a novel benchmark to facilitate a thorough assessment of human-LLM competitive programming.
- Score: 23.731654134407894
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While recent research increasingly emphasizes the value of human-LLM collaboration in competitive programming and proposes numerous empirical methods, a comprehensive understanding remains elusive due to the fragmented nature of existing studies and their use of diverse, application-specific human feedback. Thus, our work serves a three-fold purpose: First, we present the first taxonomy of human feedback consolidating the entire programming process, which promotes fine-grained evaluation. Second, we introduce ELABORATIONSET, a novel programming dataset specifically designed for human-LLM collaboration, meticulously annotated to enable large-scale simulated human feedback and facilitate costeffective real human interaction studies. Third, we introduce ELABORATION, a novel benchmark to facilitate a thorough assessment of human-LLM competitive programming. With ELABORATION, we pinpoint strengthes and weaknesses of existing methods, thereby setting the foundation for future improvement. Our code and dataset are available at https://github.com/SCUNLP/ELABORATION
Related papers
- Human-Centric Evaluation for Foundation Models [31.400215906308546]
We propose a Human-Centric subjective Evaluation framework, focusing on three core dimensions: problem-solving ability, information quality, and interaction experience.<n>We conduct over 540 participant-driven evaluations, where humans and models collaborate on open-ended research tasks.<n>Our findings highlight Grok 3's superior performance, followed by Deepseek R1 and Gemini 2.5, with OpenAI o3 mini lagging behind.
arXiv Detail & Related papers (2025-06-02T15:33:29Z) - Monocle: Hybrid Local-Global In-Context Evaluation for Long-Text Generation with Uncertainty-Based Active Learning [63.531262595858]
Divide-and-conquer approach breaks comprehensive evaluation task into localized scoring tasks, followed by a final global assessment.<n>We introduce a hybrid in-context learning approach that leverages human annotations to enhance the performance of both local and global evaluations.<n>Finally, we develop an uncertainty-based active learning algorithm that efficiently selects data samples for human annotation.
arXiv Detail & Related papers (2025-05-26T16:39:41Z) - Evaluating Text Creativity across Diverse Domains: A Dataset and Large Language Model Evaluator [26.89750429841565]
Creativity evaluation remains a challenging frontier for large language models (LLMs)<n>We propose a novel pairwise-comparison framework for assessing textual creativity, leveraging shared contextual instructions to improve consistency.<n>Through training on CreataSet, we develop an LLM-based evaluator named CrEval. CrEval demonstrates remarkable superiority over existing methods in alignment with human judgments.
arXiv Detail & Related papers (2025-05-25T17:25:23Z) - Seeing Eye to AI: Human Alignment via Gaze-Based Response Rewards for Large Language Models [46.09562860220433]
We introduce GazeReward, a novel framework that integrates implicit feedback -- and specifically eye-tracking (ET) data -- into the Reward Model (RM)<n>Our approach significantly improves the accuracy of the RM on established human preference datasets.
arXiv Detail & Related papers (2024-10-02T13:24:56Z) - Towards Optimizing and Evaluating a Retrieval Augmented QA Chatbot using LLMs with Human in the Loop [44.51779041553597]
Large Language Models have found application in mundane and repetitive tasks including Human Resource (HR) support.
We developed an HR support chatbots as an efficient and effective tool for addressing employee inquiries.
Our experiments and evaluation conclude that GPT-4 outperforms other models and can overcome inconsistencies in data.
Through expert analysis, we infer that reference-free evaluation metrics such as G-Eval and demonstrate reliability closely aligned with that of human evaluation.
arXiv Detail & Related papers (2024-07-08T13:32:14Z) - Large Language Model-based Human-Agent Collaboration for Complex Task
Solving [94.3914058341565]
We introduce the problem of Large Language Models (LLMs)-based human-agent collaboration for complex task-solving.
We propose a Reinforcement Learning-based Human-Agent Collaboration method, ReHAC.
This approach includes a policy model designed to determine the most opportune stages for human intervention within the task-solving process.
arXiv Detail & Related papers (2024-02-20T11:03:36Z) - Learning to Complement with Multiple Humans [21.247853435529446]
This paper introduces the innovative Learning to Complement with Multiple Humans (LECOMH) approach.
LECOMH is designed to learn from noisy labels without depending on clean labels, simultaneously maximising collaborative accuracy.
New benchmarks featuring multiple noisy labels for both training and testing are proposed to evaluate HAI-CC methods.
arXiv Detail & Related papers (2023-11-22T05:31:06Z) - ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate [57.71597869337909]
We build a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of generated responses from different models.
Our analysis shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments.
arXiv Detail & Related papers (2023-08-14T15:13:04Z) - Aligning Large Language Models with Human: A Survey [53.6014921995006]
Large Language Models (LLMs) trained on extensive textual corpora have emerged as leading solutions for a broad array of Natural Language Processing (NLP) tasks.
Despite their notable performance, these models are prone to certain limitations such as misunderstanding human instructions, generating potentially biased content, or factually incorrect information.
This survey presents a comprehensive overview of these alignment technologies, including the following aspects.
arXiv Detail & Related papers (2023-07-24T17:44:58Z) - A Survey of Human-in-the-loop for Machine Learning [7.056132067948671]
Human-in-the-loop aims to train an accurate prediction model with minimum cost by integrating human knowledge and experience.
This survey intends to provide a high-level summarization for human-in-the-loop and motivates interested readers to consider approaches for designing effective human-in-the-loop solutions.
arXiv Detail & Related papers (2021-08-02T14:42:28Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z) - BUSTLE: Bottom-Up Program Synthesis Through Learning-Guided Exploration [72.88493072196094]
We present a new synthesis approach that leverages learning to guide a bottom-up search over programs.
In particular, we train a model to prioritize compositions of intermediate values during search conditioned on a set of input-output examples.
We show that the combination of learning and bottom-up search is remarkably effective, even with simple supervised learning approaches.
arXiv Detail & Related papers (2020-07-28T17:46:18Z) - Human Trajectory Forecasting in Crowds: A Deep Learning Perspective [89.4600982169]
We present an in-depth analysis of existing deep learning-based methods for modelling social interactions.
We propose two knowledge-based data-driven methods to effectively capture these social interactions.
We develop a large scale interaction-centric benchmark TrajNet++, a significant yet missing component in the field of human trajectory forecasting.
arXiv Detail & Related papers (2020-07-07T17:19:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.