DELPHI: Data for Evaluating LLMs' Performance in Handling Controversial
Issues
- URL: http://arxiv.org/abs/2310.18130v2
- Date: Tue, 7 Nov 2023 20:29:53 GMT
- Title: DELPHI: Data for Evaluating LLMs' Performance in Handling Controversial
Issues
- Authors: David Q. Sun, Artem Abzaliev, Hadas Kotek, Zidi Xiu, Christopher
Klein, Jason D. Williams
- Abstract summary: Controversy is a reflection of our zeitgeist, and an important aspect to any discourse.
The rise of large language models (LLMs) as conversational systems has increased public reliance on these systems for answers to their various questions.
We propose a novel construction of a controversial questions dataset, expanding upon the publicly released Quora Question Pairs dataset.
- Score: 3.497021928281132
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Controversy is a reflection of our zeitgeist, and an important aspect to any
discourse. The rise of large language models (LLMs) as conversational systems
has increased public reliance on these systems for answers to their various
questions. Consequently, it is crucial to systematically examine how these
models respond to questions that pertaining to ongoing debates. However, few
such datasets exist in providing human-annotated labels reflecting the
contemporary discussions. To foster research in this area, we propose a novel
construction of a controversial questions dataset, expanding upon the publicly
released Quora Question Pairs Dataset. This dataset presents challenges
concerning knowledge recency, safety, fairness, and bias. We evaluate different
LLMs using a subset of this dataset, illuminating how they handle controversial
issues and the stances they adopt. This research ultimately contributes to our
understanding of LLMs' interaction with controversial issues, paving the way
for improvements in their comprehension and handling of complex societal
debates.
Related papers
- ELLIS Alicante at CQs-Gen 2025: Winning the critical thinking questions shared task: LLM-based question generation and selection [7.152439554068969]
This study is part of a shared task of the 12th Workshop on Argument Mining, co-located with ACL 2025.<n>We propose a two-step framework involving two small-scale open source language models: a Questioner that generates multiple candidate questions and a Judge that selects the most relevant ones.
arXiv Detail & Related papers (2025-06-17T10:10:51Z) - Arbiters of Ambivalence: Challenges of Using LLMs in No-Consensus Tasks [52.098988739649705]
This study examines the biases and limitations of LLMs in three roles: answer generator, judge, and debater.<n>We develop a no-consensus'' benchmark by curating examples that encompass a variety of a priori ambivalent scenarios.<n>Our results show that while LLMs can provide nuanced assessments when generating open-ended answers, they tend to take a stance on no-consensus topics when employed as judges or debaters.
arXiv Detail & Related papers (2025-05-28T01:31:54Z) - Argumentative Experience: Reducing Confirmation Bias on Controversial Issues through LLM-Generated Multi-Persona Debates [7.4355162723392585]
Large language models (LLMs) are enabling designers to give life to exciting new user experiences for information access.
Our study exposes participants to multiple viewpoints on controversial issues via a mixed-methods, within-subjects study.
Compared to a baseline search system, we see more creative interactions and diverse information-seeking with our multi-persona debate system.
arXiv Detail & Related papers (2024-12-05T21:51:05Z) - Knowledge Graphs, Large Language Models, and Hallucinations: An NLP Perspective [5.769786334333616]
Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) based applications including automated text generation, question answering, and others.
They face a significant challenge: hallucinations, where models produce plausible-sounding but factually incorrect responses.
This paper discusses these open challenges covering state-of-the-art datasets and benchmarks as well as methods for knowledge integration and evaluating hallucinations.
arXiv Detail & Related papers (2024-11-21T16:09:05Z) - NewsInterview: a Dataset and a Playground to Evaluate LLMs' Ground Gap via Informational Interviews [65.35458530702442]
We focus on journalistic interviews, a domain rich in grounding communication and abundant in data.
We curate a dataset of 40,000 two-person informational interviews from NPR and CNN.
LLMs are significantly less likely than human interviewers to use acknowledgements and to pivot to higher-level questions.
arXiv Detail & Related papers (2024-11-21T01:37:38Z) - BordIRlines: A Dataset for Evaluating Cross-lingual Retrieval-Augmented Generation [34.650355693901034]
We study the challenge of cross-lingual RAG and present a dataset to investigate the robustness of existing systems.
Our results show that existing RAG systems continue to be challenged by cross-lingual use cases and suffer from a lack of consistency when they are provided with competing information in multiple languages.
arXiv Detail & Related papers (2024-10-02T01:59:07Z) - Federated Large Language Models: Current Progress and Future Directions [63.68614548512534]
This paper surveys Federated learning for LLMs (FedLLM), highlighting recent advances and future directions.
We focus on two key aspects: fine-tuning and prompt learning in a federated setting, discussing existing work and associated research challenges.
arXiv Detail & Related papers (2024-09-24T04:14:33Z) - DebateQA: Evaluating Question Answering on Debatable Knowledge [13.199937786970027]
We introduce DebateQA, a dataset of 2,941 debatable questions.
We develop two metrics: Perspective Diversity and Dispute Awareness.
Using DebateQA with two metrics, we assess 12 popular large language models.
arXiv Detail & Related papers (2024-08-02T17:54:34Z) - LLMs' Reading Comprehension Is Affected by Parametric Knowledge and Struggles with Hypothetical Statements [59.71218039095155]
Task of reading comprehension (RC) provides a primary means to assess language models' natural language understanding (NLU) capabilities.
If the context aligns with the models' internal knowledge, it is hard to discern whether the models' answers stem from context comprehension or from internal information.
To address this issue, we suggest to use RC on imaginary data, based on fictitious facts and entities.
arXiv Detail & Related papers (2024-04-09T13:08:56Z) - Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models [61.45529177682614]
We challenge the prevailing constrained evaluation paradigm for values and opinions in large language models.
We show that models give substantively different answers when not forced.
We distill these findings into recommendations and open challenges in evaluating values and opinions in LLMs.
arXiv Detail & Related papers (2024-02-26T18:00:49Z) - Qsnail: A Questionnaire Dataset for Sequential Question Generation [76.616068047362]
We present the first dataset specifically constructed for the questionnaire generation task, which comprises 13,168 human-written questionnaires.
We conduct experiments on Qsnail, and the results reveal that retrieval models and traditional generative models do not fully align with the given research topic and intents.
Despite enhancements through the chain-of-thought prompt and finetuning, questionnaires generated by language models still fall short of human-written questionnaires.
arXiv Detail & Related papers (2024-02-22T04:14:10Z) - What Evidence Do Language Models Find Convincing? [94.90663008214918]
We build a dataset that pairs controversial queries with a series of real-world evidence documents that contain different facts.
We use this dataset to perform sensitivity and counterfactual analyses to explore which text features most affect LLM predictions.
Overall, we find that current models rely heavily on the relevance of a website to the query, while largely ignoring stylistic features that humans find important.
arXiv Detail & Related papers (2024-02-19T02:15:34Z) - Can LLMs Speak For Diverse People? Tuning LLMs via Debate to Generate Controllable Controversial Statements [30.970994382186944]
We improve the controllability of LLMs in generating statements supporting an argument the user defined in the prompt.
We develop a novel debate & tuning pipeline finetuning LLMs to generate the statements obtained via debate.
arXiv Detail & Related papers (2024-02-16T12:00:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.