Perplexity Trap: PLM-Based Retrievers Overrate Low Perplexity Documents
- URL: http://arxiv.org/abs/2503.08684v1
- Date: Tue, 11 Mar 2025 17:59:00 GMT
- Title: Perplexity Trap: PLM-Based Retrievers Overrate Low Perplexity Documents
- Authors: Haoyu Wang, Sunhao Dai, Haiyuan Zhao, Liang Pang, Xiao Zhang, Gang Wang, Zhenhua Dong, Jun Xu, Ji-Rong Wen,
- Abstract summary: We propose a causal-inspired inference-time debiasing method called Causal Diagnosis and Correction (CDC)<n>CDC first diagnoses the bias effect of the perplexity and then separates the bias effect from the overall relevance score.<n> Experimental results across three domains demonstrate the superior debiasing effectiveness.
- Score: 64.43980129731587
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Previous studies have found that PLM-based retrieval models exhibit a preference for LLM-generated content, assigning higher relevance scores to these documents even when their semantic quality is comparable to human-written ones. This phenomenon, known as source bias, threatens the sustainable development of the information access ecosystem. However, the underlying causes of source bias remain unexplored. In this paper, we explain the process of information retrieval with a causal graph and discover that PLM-based retrievers learn perplexity features for relevance estimation, causing source bias by ranking the documents with low perplexity higher. Theoretical analysis further reveals that the phenomenon stems from the positive correlation between the gradients of the loss functions in language modeling task and retrieval task. Based on the analysis, a causal-inspired inference-time debiasing method is proposed, called Causal Diagnosis and Correction (CDC). CDC first diagnoses the bias effect of the perplexity and then separates the bias effect from the overall estimated relevance score. Experimental results across three domains demonstrate the superior debiasing effectiveness of CDC, emphasizing the validity of our proposed explanatory framework. Source codes are available at https://github.com/WhyDwelledOnAi/Perplexity-Trap.
Related papers
- Paths to Causality: Finding Informative Subgraphs Within Knowledge Graphs for Knowledge-Based Causal Discovery [10.573861741540853]
We introduce a novel approach that integrates Knowledge Graphs (KGs) with Large Language Models (LLMs) to enhance knowledge-based causal discovery.<n>Our approach identifies informative metapath-based subgraphs within KGs and further refines the selection of these subgraphs using Learning-to-Rank-based models.<n>Our method outperforms most baselines by up to 44.4 points in F1 scores, evaluated across diverse LLMs and KGs.
arXiv Detail & Related papers (2025-06-10T13:13:55Z) - Data Fusion for Partial Identification of Causal Effects [62.56890808004615]
We propose a novel partial identification framework that enables researchers to answer key questions.<n>Is the causal effect positive or negative? and How severe must assumption violations be to overturn this conclusion?<n>We apply our framework to the Project STAR study, which investigates the effect of classroom size on students' third-grade standardized test performance.
arXiv Detail & Related papers (2025-05-30T07:13:01Z) - dcFCI: Robust Causal Discovery Under Latent Confounding, Unfaithfulness, and Mixed Data [1.9797215742507548]
We introduce the first nonparametric score to assess a Partial Ancestral Graph's compatibility with observed data.<n>We then propose data-compatible Fast Causal Inference (dcFCI) to jointly address latent confounding, empirical unfaithfulness, and mixed data types.
arXiv Detail & Related papers (2025-05-10T07:05:19Z) - Causally Fair Node Classification on Non-IID Graph Data [9.363036392218435]
This paper addresses the prevalent challenge in fairness-aware ML algorithms.<n>We tackle the overlooked domain of non-IID, graph-based settings.<n>We develop the Message Passing Variational Autoencoder for Causal Inference.
arXiv Detail & Related papers (2025-05-03T02:05:51Z) - Deep evolving semi-supervised anomaly detection [14.027613461156864]
The aim of this paper is to formalise the task of continual semi-supervised anomaly detection (CSAD)<n>The paper introduces a baseline model of a variational autoencoder (VAE) to work with semi-supervised data along with a continual learning method of deep generative replay with outlier rejection.
arXiv Detail & Related papers (2024-12-01T15:48:37Z) - Prompting or Fine-tuning? Exploring Large Language Models for Causal Graph Validation [0.0]
This study explores the capability of Large Language Models to evaluate causality in causal graphs.
Our study compares two approaches: (1) prompting-based method for zero-shot and few-shot causal inference and, (2) fine-tuning language models for the causal relation prediction task.
arXiv Detail & Related papers (2024-05-29T09:06:18Z) - Discovery of the Hidden World with Large Language Models [95.58823685009727]
This paper presents Causal representatiOn AssistanT (COAT) that introduces large language models (LLMs) to bridge the gap.
LLMs are trained on massive observations of the world and have demonstrated great capability in extracting key information from unstructured data.
COAT also adopts CDs to find causal relations among the identified variables as well as to provide feedback to LLMs to iteratively refine the proposed factors.
arXiv Detail & Related papers (2024-02-06T12:18:54Z) - Predicting Scientific Impact Through Diffusion, Conformity, and Contribution Disentanglement [11.684776349325887]
Existing models typically rely on static graphs for citation count estimation.
We introduce a novel model, DPPDCC, which Disentangles the Potential impacts of Papers into Diffusion, Conformity, and Contribution values.
arXiv Detail & Related papers (2023-11-15T07:21:11Z) - Exploring the Jungle of Bias: Political Bias Attribution in Language Models via Dependency Analysis [86.49858739347412]
Large Language Models (LLMs) have sparked intense debate regarding the prevalence of bias in these models and its mitigation.
We propose a prompt-based method for the extraction of confounding and mediating attributes which contribute to the decision process.
We find that the observed disparate treatment can at least in part be attributed to confounding and mitigating attributes and model misalignment.
arXiv Detail & Related papers (2023-11-15T00:02:25Z) - Towards Debiasing Frame Length Bias in Text-Video Retrieval via Causal
Intervention [72.12974259966592]
We present a unique and systematic study of a temporal bias due to frame length discrepancy between training and test sets of trimmed video clips.
We propose a causal debiasing approach and perform extensive experiments and ablation studies on the Epic-Kitchens-100, YouCook2, and MSR-VTT datasets.
arXiv Detail & Related papers (2023-09-17T15:58:27Z) - Inducing Causal Structure for Abstractive Text Summarization [76.1000380429553]
We introduce a Structural Causal Model (SCM) to induce the underlying causal structure of the summarization data.
We propose a Causality Inspired Sequence-to-Sequence model (CI-Seq2Seq) to learn the causal representations that can mimic the causal factors.
Experimental results on two widely used text summarization datasets demonstrate the advantages of our approach.
arXiv Detail & Related papers (2023-08-24T16:06:36Z) - Biases in Inverse Ising Estimates of Near-Critical Behaviour [0.0]
Inverse inference allows pairwise interactions to be reconstructed from empirical correlations.
We show that estimators used for this inference, such as Pseudo-likelihood (PLM), are biased.
Data-driven methods are explored and applied to a functional magnetic resonance imaging (fMRI) dataset from neuroscience.
arXiv Detail & Related papers (2023-01-13T14:01:43Z) - On Causality in Domain Adaptation and Semi-Supervised Learning: an Information-Theoretic Analysis for Parametric Models [40.97750409326622]
We study the learning performance of prediction in the target domain from an information-theoretic perspective.
We show that in causal learning, the excess risk depends on the size of the source sample at a rate of $O(frac1m)$ only if the labelling distribution between the source and target domains remains unchanged.
In anti-causal learning, we show that the unlabelled data dominate the performance at a rate of typically $O(frac1n)$.
arXiv Detail & Related papers (2022-05-10T03:18:48Z) - General Greedy De-bias Learning [163.65789778416172]
We propose a General Greedy De-bias learning framework (GGD), which greedily trains the biased models and the base model like gradient descent in functional space.
GGD can learn a more robust base model under the settings of both task-specific biased models with prior knowledge and self-ensemble biased model without prior knowledge.
arXiv Detail & Related papers (2021-12-20T14:47:32Z) - SAIS: Supervising and Augmenting Intermediate Steps for Document-Level
Relation Extraction [51.27558374091491]
We propose to explicitly teach the model to capture relevant contexts and entity types by supervising and augmenting intermediate steps (SAIS) for relation extraction.
Based on a broad spectrum of carefully designed tasks, our proposed SAIS method not only extracts relations of better quality due to more effective supervision, but also retrieves the corresponding supporting evidence more accurately.
arXiv Detail & Related papers (2021-09-24T17:37:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.