Enabling Large Language Models to Generate Text with Citations
- URL: http://arxiv.org/abs/2305.14627v2
- Date: Tue, 31 Oct 2023 15:04:35 GMT
- Title: Enabling Large Language Models to Generate Text with Citations
- Authors: Tianyu Gao, Howard Yen, Jiatong Yu, Danqi Chen
- Abstract summary: Large language models (LLMs) have emerged as a widely-used tool for information seeking.
Our aim is to allow LLMs to generate text with citations, improving their factual correctness and verifiability.
We propose ALCE, the first benchmark for Automatic LLMs' Citation Evaluation.
- Score: 37.64884969997378
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have emerged as a widely-used tool for
information seeking, but their generated outputs are prone to hallucination. In
this work, our aim is to allow LLMs to generate text with citations, improving
their factual correctness and verifiability. Existing work mainly relies on
commercial search engines and human evaluation, making it challenging to
reproduce and compare different modeling approaches. We propose ALCE, the first
benchmark for Automatic LLMs' Citation Evaluation. ALCE collects a diverse set
of questions and retrieval corpora and requires building end-to-end systems to
retrieve supporting evidence and generate answers with citations. We develop
automatic metrics along three dimensions -- fluency, correctness, and citation
quality -- and demonstrate their strong correlation with human judgements. Our
experiments with state-of-the-art LLMs and novel prompting strategies show that
current systems have considerable room for improvement -- For example, on the
ELI5 dataset, even the best models lack complete citation support 50% of the
time. Our analyses further highlight promising future directions, including
developing better retrievers, advancing long-context LLMs, and improving the
ability to synthesize information from multiple sources.
Related papers
- Factuality or Fiction? Benchmarking Modern LLMs on Ambiguous QA with Citations [0.0]
We evaluate the factual accuracy and citation performance of state-of-the-art large language models (LLMs) on the task of Question Answering (QA)
Our results show that larger, recent models consistently predict at least one correct answer in ambiguous contexts but fail to handle cases with multiple valid answers.
arXiv Detail & Related papers (2024-12-23T23:55:19Z) - Think&Cite: Improving Attributed Text Generation with Self-Guided Tree Search and Progress Reward Modeling [63.98194996746229]
Large language models (LLMs) are prone to hallucination and producing factually incorrect information.
We propose a novel framework, called Think&Cite, and formulate attributed text generation as a multi-step reasoning problem integrated with search.
arXiv Detail & Related papers (2024-12-19T13:55:48Z) - Advancing Large Language Model Attribution through Self-Improving [32.77250400438304]
We present START, a framework for improving the attribution capability of large language models (LLMs)
START iteratively utilizes fine-grained preference supervision signals constructed from its sampled responses to encourage robust, comprehensive, and attributable generation.
Experiments on three open-domain question-answering datasets, covering long-form QA and multi-step reasoning, demonstrate significant performance gains of 25.13% on average.
arXiv Detail & Related papers (2024-10-17T07:55:33Z) - On the Capacity of Citation Generation by Large Language Models [38.47160164251295]
Retrieval-augmented generation (RAG) appears as a promising method to alleviate the "hallucination" problem in large language models (LLMs)
arXiv Detail & Related papers (2024-10-15T03:04:26Z) - Citekit: A Modular Toolkit for Large Language Model Citation Generation [22.00342064028764]
Large Language Models (LLMs) generate citations in Question-Answering (QA) tasks.
There is currently no unified framework to standardize and fairly compare different citation generation methods.
We introduce name, an open-source and modular toolkit designed to facilitate the implementation and evaluation of existing citation generation methods.
arXiv Detail & Related papers (2024-08-06T02:13:15Z) - Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books.
Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z) - Improving Attributed Text Generation of Large Language Models via Preference Learning [28.09715554543885]
We model the attribution task as preference learning and introduce an Automatic Preference Optimization framework.
APO achieves state-of-the-art citation F1 with higher answer quality.
arXiv Detail & Related papers (2024-03-27T09:19:13Z) - Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - Supervised Knowledge Makes Large Language Models Better In-context Learners [94.89301696512776]
Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering.
The challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored.
We propose a framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks.
arXiv Detail & Related papers (2023-12-26T07:24:46Z) - AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators [98.11286353828525]
GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks.
We propose AnnoLLM, which adopts a two-step approach, explain-then-annotate.
We build the first conversation-based information retrieval dataset employing AnnoLLM.
arXiv Detail & Related papers (2023-03-29T17:03:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.