Related papers: LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA

LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA

URL: http://arxiv.org/abs/2409.02897v3
Date: Tue, 10 Sep 2024 07:43:19 GMT
Title: LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA
Authors: Jiajie Zhang, Yushi Bai, Xin Lv, Wanjun Gu, Danqing Liu, Minhao Zou, Shulin Cao, Lei Hou, Yuxiao Dong, Ling Feng, Juanzi Li,
Abstract summary: Long-context large language models (LLMs) have demonstrated impressive capacities in answering user questions based on extensive text. The lack of citations in their responses makes user verification difficult, leading to concerns about their trustworthiness. We aim to enable long-context LLMs to generate responses with fine-grained sentence-level citations, improving their faithfulness and verifiability.
Score: 52.30374900597116
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Though current long-context large language models (LLMs) have demonstrated impressive capacities in answering user questions based on extensive text, the lack of citations in their responses makes user verification difficult, leading to concerns about their trustworthiness due to their potential hallucinations. In this work, we aim to enable long-context LLMs to generate responses with fine-grained sentence-level citations, improving their faithfulness and verifiability. We first introduce LongBench-Cite, an automated benchmark for assessing current LLMs' performance in Long-Context Question Answering with Citations (LQAC), revealing considerable room for improvement. To this end, we propose CoF (Coarse to Fine), a novel pipeline that utilizes off-the-shelf LLMs to automatically generate long-context QA instances with precise sentence-level citations, and leverage this pipeline to construct LongCite-45k, a large-scale SFT dataset for LQAC. Finally, we train LongCite-8B and LongCite-9B using the LongCite-45k dataset, successfully enabling their generation of accurate responses and fine-grained sentence-level citations in a single output. The evaluation results on LongBench-Cite show that our trained models achieve state-of-the-art citation quality, surpassing advanced proprietary models including GPT-4o.

Related papers

Beyond Length: Quantifying Long-Range Information for Long-Context LLM Pretraining Data [67.46386646195818]
We introduce LongFilter, a framework for curating training data tailored to long-context pretraining.<n>LongFilter measures the information gain provided by extended context by contrasting model predictions under long-context versus short-context settings.<n>Experiments with LLaMA-3-8B, extending its context length from 8K to 64K, show that LongFilter efficiently selects high-quality data and yields substantial improvements on benchmarks such as HELMET, LongBench, and RULER.
arXiv Detail & Related papers (2025-10-29T06:21:08Z)
NeedleChain: Measuring Intact Long-Context Reasoning Capability of Large Language Models [7.134358758293254]
The Needle-in-a-Haystack benchmark is widely used to evaluate Large Language Models' (LLMs) ability to understand long contexts (LC)<n>We demonstrate that even state-of-the-art models such as GPT-4o struggle to intactly incorporate given contexts made up of solely query-relevant ten sentences.<n>We introduce a novel benchmark, textbfNeedleChain, where the context consists entirely of query-relevant information.
arXiv Detail & Related papers (2025-07-30T06:29:50Z)
100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability? [28.694112253150983]
Real-task-based long-context evaluation benchmarks have two major shortcomings.<n> benchmarks like LongBench often do not provide proper metrics to separate long-context performance from the model's baseline ability.<n>We introduce a length-controllable long-context benchmark and a novel metric that disentangles baseline knowledge from true long-context capabilities.
arXiv Detail & Related papers (2025-05-25T19:58:31Z)
LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization [49.37607974207405]
LongPO harnesses short-to-long preference data to transfer short-context capabilities to long-context tasks. LongPO fully retains short-context performance and largely outperforms naive SFT and DPO in both long- and short-context tasks.
arXiv Detail & Related papers (2025-02-19T17:59:03Z)
SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models [51.90867482317985]
SelfCite is a self-supervised approach to generate fine-grained, sentence-level citations for statements in generated responses.<n>The effectiveness of SelfCite is demonstrated by increasing citation F1 up to 5.3 points on the LongBench-Cite benchmark.
arXiv Detail & Related papers (2025-02-13T18:55:13Z)
What is Wrong with Perplexity for Long-context Language Modeling? [71.34933096461124]
Long-context inputs are crucial for large language models (LLMs) in tasks such as extended conversations, document summarization, and many-shot in-context learning. Perplexity (PPL) has proven unreliable for assessing long-context capabilities. We propose bfLongPPL, a novel metric that focuses on key tokens by employing a long-short context contrastive method to identify them.
arXiv Detail & Related papers (2024-10-31T09:39:28Z)
ALR$^2$: A Retrieve-then-Reason Framework for Long-context Question Answering [42.146660039671076]
We develop a retrieve-then-reason framework for large language models (LLMs) We find that modern LLMs struggle to accurately retrieve relevant facts and instead, often hallucinate "retrieved facts" We introduce ALR$2$, a method that augments the long-context reasoning capability of LLMs via an explicit two-stage procedure.
arXiv Detail & Related papers (2024-10-04T08:29:12Z)
DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels [89.51834016940153]
We introduce DetectiveQA, a narrative reasoning benchmark with an average context length of over 100K tokens. We use detective novels as data sources, which naturally have various reasoning elements. We manually annotated 600 questions in Chinese and then also provided an English edition of the context information and questions.
arXiv Detail & Related papers (2024-09-04T06:28:22Z)
Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA [71.04146366608904]
Long-context modeling capabilities have garnered widespread attention, leading to the emergence of Large Language Models (LLMs) with ultra-context windows. We propose a novel long-context benchmark, Loong, aligning with realistic scenarios through extended multi-document question answering (QA) Loong introduces four types of tasks with a range of context lengths: Spotlight Locating, Comparison, Clustering, and Chain of Reasoning.
arXiv Detail & Related papers (2024-06-25T09:42:56Z)
Long Context is Not Long at All: A Prospector of Long-Dependency Data for Large Language Models [13.091271774417867]
Long-context modeling capabilities are important for large language models (LLMs) in various applications. We propose a data mining framework textbfProLong that can assign each training sample with a long dependency score. Comprehensive experiments on multiple benchmarks indicate that ProLong effectively identifies documents that carry long dependencies.
arXiv Detail & Related papers (2024-05-28T07:36:56Z)
Long Context Alignment with Short Instructions and Synthesized Positions [56.1267385315404]
This paper introduces Step-Skipping Alignment (SkipAlign) It is a new technique designed to enhance the long-context capabilities of Large Language Models (LLMs) With a careful selection of the base model and alignment datasets, SkipAlign with only 6B parameters achieves it's best performance and comparable with strong baselines like GPT-3.5-Turbo-16K on LongBench.
arXiv Detail & Related papers (2024-05-07T01:56:22Z)
Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks [76.43527940649939]
We introduce Ada-LEval, a benchmark for evaluating the long-context understanding of large language models (LLMs) Ada-LEval includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs' long context capabilities. We evaluate 4 state-of-the-art closed-source API models and 6 open-source models with Ada-LEval.
arXiv Detail & Related papers (2024-04-09T17:30:48Z)
LooGLE: Can Long-Context Language Models Understand Long Contexts? [46.143956498529796]
LooGLE is a benchmark for large language models' long context understanding. It features relatively new documents post-2022, with over 24,000 tokens per document and 6,000 newly generated questions spanning diverse domains. The evaluation of eight state-of-the-art LLMs on LooGLE revealed key findings.
arXiv Detail & Related papers (2023-11-08T01:45:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.