Related papers: What Evidence Do Language Models Find Convincing?

What Evidence Do Language Models Find Convincing?

URL: http://arxiv.org/abs/2402.11782v1
Date: Mon, 19 Feb 2024 02:15:34 GMT
Title: What Evidence Do Language Models Find Convincing?
Authors: Alexander Wan, Eric Wallace, Dan Klein
Abstract summary: We build a dataset that pairs controversial queries with a series of real-world evidence documents that contain different facts. We use this dataset to perform sensitivity and counterfactual analyses to explore which text features most affect LLM predictions. Overall, we find that current models rely heavily on the relevance of a website to the query, while largely ignoring stylistic features that humans find important.
Score: 103.67867531892988
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Retrieval-augmented language models are being increasingly tasked with subjective, contentious, and conflicting queries such as "is aspartame linked to cancer". To resolve these ambiguous queries, one must search through a large range of websites and consider "which, if any, of this evidence do I find convincing?". In this work, we study how LLMs answer this question. In particular, we construct ConflictingQA, a dataset that pairs controversial queries with a series of real-world evidence documents that contain different facts (e.g., quantitative results), argument styles (e.g., appeals to authority), and answers (Yes or No). We use this dataset to perform sensitivity and counterfactual analyses to explore which text features most affect LLM predictions. Overall, we find that current models rely heavily on the relevance of a website to the query, while largely ignoring stylistic features that humans find important such as whether a text contains scientific references or is written with a neutral tone. Taken together, these results highlight the importance of RAG corpus quality (e.g., the need to filter misinformation), and possibly even a shift in how LLMs are trained to better align with human judgements.

Related papers

Profiling News Media for Factuality and Bias Using LLMs and the Fact-Checking Methodology of Human Experts [29.95198868148809]
We propose a novel methodology that emulates the criteria that professional fact-checkers use to assess the factuality and political bias of an entire outlet.<n>We provide an in-depth error analysis of the effect of media popularity and region on model performance.
arXiv Detail & Related papers (2025-06-14T15:49:20Z)
Search Arena: Analyzing Search-Augmented LLMs [61.28673331156436]
We introduce Search Arena, a crowd-sourced, large-scale, human-preference dataset of over 24,000 paired multi-turn user interactions.<n>The dataset spans diverse intents and languages, and contains full system traces with around 12,000 human preference votes.<n>Our analysis reveals that user preferences are influenced by the number of citations, even when the cited content does not directly support the attributed claims.
arXiv Detail & Related papers (2025-06-05T17:59:26Z)
Fact-checking with Generative AI: A Systematic Cross-Topic Examination of LLMs Capacity to Detect Veracity of Political Information [0.0]
The purpose of this study is to assess how large language models (LLMs) can be used for fact-checking. We use AI auditing methodology that systematically evaluates performance of five LLMs. The results indicate that models are better at identifying false statements, especially on sensitive topics.
arXiv Detail & Related papers (2025-03-11T13:06:40Z)
Idiosyncrasies in Large Language Models [54.26923012617675]
We unveil and study idiosyncrasies in Large Language Models (LLMs) We find that fine-tuning existing text embedding models on LLM-generated texts yields excellent classification accuracy. We leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies.
arXiv Detail & Related papers (2025-02-17T18:59:02Z)
Potential and Perils of Large Language Models as Judges of Unstructured Textual Data [0.631976908971572]
This research investigates the effectiveness of LLM-as-judge models to evaluate the thematic alignment of summaries generated by other LLMs. Our findings reveal that while LLM-as-judge offer a scalable solution comparable to human raters, humans may still excel at detecting subtle, context-specific nuances.
arXiv Detail & Related papers (2025-01-14T14:49:14Z)
Harnessing Large Language Models for Knowledge Graph Question Answering via Adaptive Multi-Aspect Retrieval-Augmentation [81.18701211912779]
We introduce an Adaptive Multi-Aspect Retrieval-augmented over KGs (Amar) framework. This method retrieves knowledge including entities, relations, and subgraphs, and converts each piece of retrieved text into prompt embeddings. Our method has achieved state-of-the-art performance on two common datasets.
arXiv Detail & Related papers (2024-12-24T16:38:04Z)
BordIRlines: A Dataset for Evaluating Cross-lingual Retrieval-Augmented Generation [34.650355693901034]
We study the challenge of cross-lingual RAG and present a dataset to investigate the robustness of existing systems. Our results show that existing RAG systems continue to be challenged by cross-lingual use cases and suffer from a lack of consistency when they are provided with competing information in multiple languages.
arXiv Detail & Related papers (2024-10-02T01:59:07Z)
LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing [106.45895712717612]
Large language models (LLMs) have shown remarkable versatility in various generative tasks. This study focuses on the topic of LLMs assist NLP Researchers. To our knowledge, this is the first work to provide such a comprehensive analysis.
arXiv Detail & Related papers (2024-06-24T01:30:22Z)
WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia [59.96425443250666]
Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs) In this work, we conduct a comprehensive evaluation of LLM-generated answers to questions based on contradictory passages from Wikipedia. We benchmark a diverse range of both closed and open-source LLMs under different QA scenarios, including RAG with a single passage, and RAG with 2 contradictory passages.
arXiv Detail & Related papers (2024-06-19T20:13:42Z)
Investigating Annotator Bias in Large Language Models for Hate Speech Detection [5.589665886212444]
This paper delves into the biases present in Large Language Models (LLMs) when annotating hate speech data. Specifically targeting highly vulnerable groups within these categories, we analyze annotator biases. We introduce our custom hate speech detection dataset, HateBiasNet, to conduct this research.
arXiv Detail & Related papers (2024-06-17T00:18:31Z)
Can't say cant? Measuring and Reasoning of Dark Jargons in Large Language Models [10.666290735480821]
This paper introduces a domain-specific Cant dataset and CantCounter evaluation framework. Experiments reveal LLMs are susceptible to cant bypassing filters. Updated models exhibit higher acceptance rates for cant queries.
arXiv Detail & Related papers (2024-04-25T17:25:53Z)
Customizing Language Model Responses with Contrastive In-Context Learning [7.342346948935483]
We propose an approach that uses contrastive examples to better describe our intent. This involves providing positive examples that illustrate the true intent, along with negative examples that show what characteristics we want LLMs to avoid. Before generating an answer, we ask the model to analyze the examples to teach itself what to avoid. This reasoning step provides the model with the appropriate articulation of the user's need and guides it towards generting a better answer.
arXiv Detail & Related papers (2024-01-30T19:13:12Z)
You don't need a personality test to know these models are unreliable: Assessing the Reliability of Large Language Models on Psychometric Instruments [37.03210795084276]
We examine whether the current format of prompting Large Language Models elicits responses in a consistent and robust manner. Our experiments on 17 different LLMs reveal that even simple perturbations significantly downgrade a model's question-answering ability. Our results suggest that the currently widespread practice of prompting is insufficient to accurately and reliably capture model perceptions.
arXiv Detail & Related papers (2023-11-16T09:50:53Z)
Do LLMs exhibit human-like response biases? A case study in survey design [66.1850490474361]
We investigate the extent to which large language models (LLMs) reflect human response biases, if at all. We design a dataset and framework to evaluate whether LLMs exhibit human-like response biases in survey questionnaires. Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior.
arXiv Detail & Related papers (2023-11-07T15:40:43Z)
Can Large Language Models Infer Causation from Correlation? [104.96351414570239]
We test the pure causal inference skills of large language models (LLMs) We formulate a novel task Corr2Cause, which takes a set of correlational statements and determines the causal relationship between the variables. We show that these models achieve almost close to random performance on the task.
arXiv Detail & Related papers (2023-06-09T12:09:15Z)
Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious. We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.