Eliciting Informative Text Evaluations with Large Language Models
- URL: http://arxiv.org/abs/2405.15077v4
- Date: Mon, 2 Sep 2024 20:25:36 GMT
- Title: Eliciting Informative Text Evaluations with Large Language Models
- Authors: Yuxuan Lu, Shengwei Xu, Yichi Zhang, Yuqing Kong, Grant Schoenebeck,
- Abstract summary: We introduce two mechanisms, the Generative Peer Prediction Mechanism (GPPM) and the Generative Synopsis Peer Prediction Mechanism (GSPPM)
We show that our mechanisms can incentivize high effort and truth-telling as an (approximate) Bayesian Nash equilibrium.
We highlight the results that on the ICLR dataset, our mechanisms can differentiate three quality levels -- human-written reviews, GPT-4-generated reviews, and GPT-3.5-generated reviews in terms of expected scores.
- Score: 14.176332393753906
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Peer prediction mechanisms motivate high-quality feedback with provable guarantees. However, current methods only apply to rather simple reports, like multiple-choice or scalar numbers. We aim to broaden these techniques to the larger domain of text-based reports, drawing on the recent developments in large language models. This vastly increases the applicability of peer prediction mechanisms as textual feedback is the norm in a large variety of feedback channels: peer reviews, e-commerce customer reviews, and comments on social media. We introduce two mechanisms, the Generative Peer Prediction Mechanism (GPPM) and the Generative Synopsis Peer Prediction Mechanism (GSPPM). These mechanisms utilize LLMs as predictors, mapping from one agent's report to a prediction of her peer's report. Theoretically, we show that when the LLM prediction is sufficiently accurate, our mechanisms can incentivize high effort and truth-telling as an (approximate) Bayesian Nash equilibrium. Empirically, we confirm the efficacy of our mechanisms through experiments conducted on two real datasets: the Yelp review dataset and the ICLR OpenReview dataset. We highlight the results that on the ICLR dataset, our mechanisms can differentiate three quality levels -- human-written reviews, GPT-4-generated reviews, and GPT-3.5-generated reviews in terms of expected scores. Additionally, GSPPM penalizes LLM-generated reviews more effectively than GPPM.
Related papers
- Deep Transfer Learning Based Peer Review Aggregation and Meta-review Generation for Scientific Articles [2.0778556166772986]
We address two peer review aggregation challenges: paper acceptance decision-making and meta-review generation.
Firstly, we propose to automate the process of acceptance decision prediction by applying traditional machine learning algorithms.
For the meta-review generation, we propose a transfer learning model based on the T5 model.
arXiv Detail & Related papers (2024-10-05T15:40:37Z) - Editable Fairness: Fine-Grained Bias Mitigation in Language Models [52.66450426729818]
We propose a novel debiasing approach, Fairness Stamp (FAST), which enables fine-grained calibration of individual social biases.
FAST surpasses state-of-the-art baselines with superior debiasing performance.
This highlights the potential of fine-grained debiasing strategies to achieve fairness in large language models.
arXiv Detail & Related papers (2024-08-07T17:14:58Z) - AgentReview: Exploring Peer Review Dynamics with LLM Agents [13.826819101545926]
We introduce AgentReview, the first large language model (LLM) based peer review simulation framework.
Our study reveals significant insights, including a notable 37.1% variation in paper decisions due to reviewers' biases.
arXiv Detail & Related papers (2024-06-18T15:22:12Z) - Rumour Evaluation with Very Large Language Models [2.6861033447765217]
This work proposes to leverage the advancement of prompting-dependent large language models to combat misinformation.
We employ two prompting-based LLM variants to extend the two RumourEval subtasks.
For veracity prediction, three classifications schemes are experimented per GPT variant. Each scheme is tested in zero-, one- and few-shot settings.
For stance classification, prompting-based-approaches show comparable performance to prior results, with no improvement over finetuning methods.
arXiv Detail & Related papers (2024-04-11T19:38:22Z) - The Devil is in the Errors: Leveraging Large Language Models for
Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations.
We study the impact of labeled data through in-context learning and finetuning.
We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z) - Evaluating and Explaining Large Language Models for Code Using Syntactic
Structures [74.93762031957883]
This paper introduces ASTxplainer, an explainability method specific to Large Language Models for code.
At its core, ASTxplainer provides an automated method for aligning token predictions with AST nodes.
We perform an empirical evaluation on 12 popular LLMs for code using a curated dataset of the most popular GitHub projects.
arXiv Detail & Related papers (2023-08-07T18:50:57Z) - Can ChatGPT Forecast Stock Price Movements? Return Predictability and Large Language Models [51.3422222472898]
We document the capability of large language models (LLMs) like ChatGPT to predict stock price movements using news headlines.
We develop a theoretical model incorporating information capacity constraints, underreaction, limits-to-arbitrage, and LLMs.
arXiv Detail & Related papers (2023-04-15T19:22:37Z) - Investigating Fairness Disparities in Peer Review: A Language Model
Enhanced Approach [77.61131357420201]
We conduct a thorough and rigorous study on fairness disparities in peer review with the help of large language models (LMs)
We collect, assemble, and maintain a comprehensive relational database for the International Conference on Learning Representations (ICLR) conference from 2017 to date.
We postulate and study fairness disparities on multiple protective attributes of interest, including author gender, geography, author, and institutional prestige.
arXiv Detail & Related papers (2022-11-07T16:19:42Z) - Test-time Collective Prediction [73.74982509510961]
Multiple parties in machine learning want to jointly make predictions on future test points.
Agents wish to benefit from the collective expertise of the full set of agents, but may not be willing to release their data or model parameters.
We explore a decentralized mechanism to make collective predictions at test time, leveraging each agent's pre-trained model.
arXiv Detail & Related papers (2021-06-22T18:29:58Z) - Unsupervised Explanation Generation for Machine Reading Comprehension [36.182335120466895]
We propose a self-explainable framework for the machine reading comprehension task.
The proposed system tries to use less passage information and achieve similar results compared to the system that uses the whole passage.
To evaluate the explainability, we compared our approach with the traditional attention mechanism in human evaluations and found that the proposed system has a notable advantage over the latter one.
arXiv Detail & Related papers (2020-11-13T02:58:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.