LLMs may Dominate Information Access: Neural Retrievers are Biased
Towards LLM-Generated Texts
- URL: http://arxiv.org/abs/2310.20501v2
- Date: Sun, 14 Jan 2024 14:41:06 GMT
- Title: LLMs may Dominate Information Access: Neural Retrievers are Biased
Towards LLM-Generated Texts
- Authors: Sunhao Dai, Yuqi Zhou, Liang Pang, Weihao Liu, Xiaolin Hu, Yong Liu,
Xiao Zhang, Gang Wang and Jun Xu
- Abstract summary: Large language models (LLMs) have revolutionized the paradigm of information retrieval (IR) applications.
Surprisingly, our findings indicate that neural retrieval models tend to rank LLM-generated documents higher.
To mitigate the source bias, we also propose a plug-and-play debiased constraint for the optimization objective.
- Score: 36.73455759259717
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recently, the emergence of large language models (LLMs) has revolutionized
the paradigm of information retrieval (IR) applications, especially in web
search. With their remarkable capabilities in generating human-like texts, LLMs
have created enormous texts on the Internet. As a result, IR systems in the
LLMs era are facing a new challenge: the indexed documents now are not only
written by human beings but also automatically generated by the LLMs. How these
LLM-generated documents influence the IR systems is a pressing and still
unexplored question. In this work, we conduct a quantitative evaluation of
different IR models in scenarios where both human-written and LLM-generated
texts are involved. Surprisingly, our findings indicate that neural retrieval
models tend to rank LLM-generated documents higher. We refer to this category
of biases in neural retrieval models towards the LLM-generated text as the
\textbf{source bias}. Moreover, we discover that this bias is not confined to
the first-stage neural retrievers, but extends to the second-stage neural
re-rankers. Then, we provide an in-depth analysis from the perspective of text
compression and observe that neural models can better understand the semantic
information of LLM-generated text, which is further substantiated by our
theoretical analysis. To mitigate the source bias, we also propose a
plug-and-play debiased constraint for the optimization objective, and
experimental results show the effectiveness. Finally, we discuss the potential
severe concerns stemming from the observed source bias and hope our findings
can serve as a critical wake-up call to the IR community and beyond. To
facilitate future explorations of IR in the LLM era, the constructed two new
benchmarks and codes will later be available at
\url{https://github.com/KID-22/LLM4IR-Bias}.
Related papers
- ReMoDetect: Reward Models Recognize Aligned LLM's Generations [55.06804460642062]
Large language models (LLMs) generate human-preferable texts.
We propose two training schemes to further improve the detection ability of the reward model.
arXiv Detail & Related papers (2024-05-27T17:38:33Z) - Understanding Privacy Risks of Embeddings Induced by Large Language Models [75.96257812857554]
Large language models show early signs of artificial general intelligence but struggle with hallucinations.
One promising solution is to store external knowledge as embeddings, aiding LLMs in retrieval-augmented generation.
Recent studies experimentally showed that the original text can be partially reconstructed from text embeddings by pre-trained language models.
arXiv Detail & Related papers (2024-04-25T13:10:48Z) - Unsupervised Information Refinement Training of Large Language Models for Retrieval-Augmented Generation [128.01050030936028]
We propose an information refinement training method named InFO-RAG.
InFO-RAG is low-cost and general across various tasks.
It improves the performance of LLaMA2 by an average of 9.39% relative points.
arXiv Detail & Related papers (2024-02-28T08:24:38Z) - Rethinking Interpretability in the Era of Large Language Models [76.1947554386879]
Large language models (LLMs) have demonstrated remarkable capabilities across a wide array of tasks.
The capability to explain in natural language allows LLMs to expand the scale and complexity of patterns that can be given to a human.
These new capabilities raise new challenges, such as hallucinated explanations and immense computational costs.
arXiv Detail & Related papers (2024-01-30T17:38:54Z) - A Survey on LLM-Generated Text Detection: Necessity, Methods, and Future Directions [39.36381851190369]
There is an imperative need to develop detectors that can detect LLM-generated text.
This is crucial to mitigate potential misuse of LLMs and safeguard realms like artistic expression and social networks from harmful influence of LLM-generated content.
The detector techniques have witnessed notable advancements recently, propelled by innovations in watermarking techniques, statistics-based detectors, neural-base detectors, and human-assisted methods.
arXiv Detail & Related papers (2023-10-23T09:01:13Z) - Bad Actor, Good Advisor: Exploring the Role of Large Language Models in
Fake News Detection [22.658378054986624]
Large language models (LLMs) have shown remarkable performance in various tasks.
LLMs provide desirable multi-perspective rationales but still underperform the basic SLM, fine-tuned BERT.
We propose that current LLMs may not substitute fine-tuned SLMs in fake news detection but can be a good advisor for SLMs.
arXiv Detail & Related papers (2023-09-21T16:47:30Z) - Neural Authorship Attribution: Stylometric Analysis on Large Language
Models [16.63955074133222]
Large language models (LLMs) such as GPT-4, PaLM, and Llama have significantly propelled the generation of AI-crafted text.
With rising concerns about their potential misuse, there is a pressing need for AI-generated-text forensics.
arXiv Detail & Related papers (2023-08-14T17:46:52Z) - Synergistic Interplay between Search and Large Language Models for
Information Retrieval [141.18083677333848]
InteR allows RMs to expand knowledge in queries using LLM-generated knowledge collections.
InteR achieves overall superior zero-shot retrieval performance compared to state-of-the-art methods.
arXiv Detail & Related papers (2023-05-12T11:58:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.