Estimating Contribution Quality in Online Deliberations Using a Large Language Model
- URL: http://arxiv.org/abs/2408.11936v1
- Date: Wed, 21 Aug 2024 18:41:32 GMT
- Title: Estimating Contribution Quality in Online Deliberations Using a Large Language Model
- Authors: Lodewijk Gelauff, Mohak Goyal, Bhargav Dindukurthi, Ashish Goel, Alice Siu,
- Abstract summary: We use a large language model (LLM) alongside eight human annotators to rate contributions based on justification, novelty, expansion of the conversation, and potential for further expansion.
Using the average rating from other human annotators as the ground truth, we find the model outperforms individual human annotators.
We illustrate the usefulness of the automated quality rating by assessing the effect of nudges on the quality of deliberation.
- Score: 4.911986505938227
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deliberation involves participants exchanging knowledge, arguments, and perspectives and has been shown to be effective at addressing polarization. The Stanford Online Deliberation Platform facilitates large-scale deliberations. It enables video-based online discussions on a structured agenda for small groups without requiring human moderators. This paper's data comes from various deliberation events, including one conducted in collaboration with Meta in 32 countries, and another with 38 post-secondary institutions in the US. Estimating the quality of contributions in a conversation is crucial for assessing feature and intervention impacts. Traditionally, this is done by human annotators, which is time-consuming and costly. We use a large language model (LLM) alongside eight human annotators to rate contributions based on justification, novelty, expansion of the conversation, and potential for further expansion, with scores ranging from 1 to 5. Annotators also provide brief justifications for their ratings. Using the average rating from other human annotators as the ground truth, we find the model outperforms individual human annotators. While pairs of human annotators outperform the model in rating justification and groups of three outperform it on all four metrics, the model remains competitive. We illustrate the usefulness of the automated quality rating by assessing the effect of nudges on the quality of deliberation. We first observe that individual nudges after prolonged inactivity are highly effective, increasing the likelihood of the individual requesting to speak in the next 30 seconds by 65%. Using our automated quality estimation, we show that the quality ratings for statements prompted by nudging are similar to those made without nudging, signifying that nudging leads to more ideas being generated in the conversation without losing overall quality.
Related papers
- Towards Unifying Evaluation of Counterfactual Explanations: Leveraging Large Language Models for Human-Centric Assessments [0.7852714805965528]
We develop a set of 30 counterfactual scenarios and collect ratings across 8 evaluation metrics from 206 respondents.
We fine-tuned different Large Language Models to predict average or individual human judgment across these metrics.
arXiv Detail & Related papers (2024-10-28T15:33:37Z) - ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate [57.71597869337909]
We build a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of generated responses from different models.
Our analysis shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments.
arXiv Detail & Related papers (2023-08-14T15:13:04Z) - Large Language Models are not Fair Evaluators [60.27164804083752]
We find that the quality ranking of candidate responses can be easily hacked by altering their order of appearance in the context.
This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other.
We propose a framework with three simple yet effective strategies to mitigate this issue.
arXiv Detail & Related papers (2023-05-29T07:41:03Z) - Measuring the Effect of Influential Messages on Varying Personas [67.1149173905004]
We present a new task, Response Forecasting on Personas for News Media, to estimate the response a persona might have upon seeing a news message.
The proposed task not only introduces personalization in the modeling but also predicts the sentiment polarity and intensity of each response.
This enables more accurate and comprehensive inference on the mental state of the persona.
arXiv Detail & Related papers (2023-05-25T21:01:00Z) - Prediction of Listener Perception of Argumentative Speech in a
Crowdsourced Data Using (Psycho-)Linguistic and Fluency Features [24.14001104126045]
We aim to predict TED talk-style affective ratings in a crowdsourced dataset of argumentative speech.
We present an effective approach to the classification task of predicting these categories through fine-tuning a model pre-trained on a large dataset of TED talks public speeches.
arXiv Detail & Related papers (2021-11-13T15:07:13Z) - I Beg to Differ: A study of constructive disagreement in online
conversations [15.581515781839656]
We construct a corpus of 7 425 Wikipedia Talk page conversations that contain content disputes.
We define the task of predicting whether disagreements will be escalated to mediation by a moderator.
We develop a variety of neural models and show that taking into account the structure of the conversation improves predictive accuracy.
arXiv Detail & Related papers (2021-01-26T16:36:43Z) - Dialogue Response Ranking Training with Large-Scale Human Feedback Data [52.12342165926226]
We leverage social media feedback data to build a large-scale training dataset for feedback prediction.
We trained DialogRPT, a set of GPT-2 based models on 133M pairs of human feedback data.
Our ranker outperforms the conventional dialog perplexity baseline with a large margin on predicting Reddit feedback.
arXiv Detail & Related papers (2020-09-15T10:50:05Z) - Large Scale Multi-Actor Generative Dialog Modeling [22.286624163849893]
We introduce the Generative Conversation Control model, a language model that conditions on past reference conversations to probabilistically model multi-turn conversations in the actor's persona.
scaling model sizes from 117M to 8.3B parameters yields an improvement from 23.14 to 13.14 perplexity on 1.7M held out Reddit conversations.
We find that conditionally modeling past conversations improves perplexity by 0.47 in automatic evaluations.
arXiv Detail & Related papers (2020-05-13T01:56:00Z) - Learning an Unreferenced Metric for Online Dialogue Evaluation [53.38078951628143]
We propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances.
We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
arXiv Detail & Related papers (2020-05-01T20:01:39Z) - You Impress Me: Dialogue Generation via Mutual Persona Perception [62.89449096369027]
The research in cognitive science suggests that understanding is an essential signal for a high-quality chit-chat conversation.
Motivated by this, we propose P2 Bot, a transmitter-receiver based framework with the aim of explicitly modeling understanding.
arXiv Detail & Related papers (2020-04-11T12:51:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.