CNNSum: Exploring Long-Context Summarization with Large Language Models in Chinese Novels
- URL: http://arxiv.org/abs/2412.02819v4
- Date: Tue, 17 Dec 2024 16:03:43 GMT
- Title: CNNSum: Exploring Long-Context Summarization with Large Language Models in Chinese Novels
- Authors: Lingxiao Wei, He Yan, Xiangju Lu, Junmin Zhu, Jun Wang, Wei Zhang,
- Abstract summary: CNNSum is a benchmark for long-context summarization based on Chinese novels.
It features four subsets totaling 695 samples, with lengths ranging from 16k to 128k.
We conduct experiments to explore and improve long-context summarization.
- Score: 11.614599448394374
- License:
- Abstract: Large Language Models (LLMs) have been well-researched in various long-context tasks. However, the scarcity of high-quality long-context summarization datasets has hindered further advancements in this area. To address this, we introduce CNNSum, a multi-scale long-context summarization benchmark based on Chinese novels, featuring human-driven annotations, which comprises four subsets totaling 695 samples, with lengths ranging from 16k to 128k. We evaluate numerous LLMs and conduct detailed case analyses. Furthermore, we conduct extensive fine-tuning experiments to explore and improve long-context summarization. In our study: (1) Advanced LLMs like GPT-4o may still generate subjective commentary, leading to vague summaries. (2) Currently, long-context summarization mainly relies on memory ability afforded by longer context lengths. The advantages of Large LLMs are hard to utilize, thus small LLMs are the most cost-effective. (3) Different prompt templates paired with various version models may cause large performance gaps. In further fine-tuning, these can be mitigated, and the Base version models perform better. (4) LLMs with RoPE-base scaled exhibit strong extrapolation potential; using short-context data can significantly improve long-context summarization performance. However, further applying other interpolation methods requires careful selection. (5) CNNSum provides more reliable and insightful evaluation results than other benchmarks. We release CNNSum to advance future research in this field. https://github.com/CxsGhost/CNNSum
Related papers
- Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks? [36.83397306207386]
We evaluate the capabilities of 17 leading Large Language Models (LLMs)
Strikingly, many models are remarkably threadsafe: capable of simultaneously following multiple threads without significant loss in performance.
We find the effective context limit is significantly shorter than the supported context length, with accuracy decreasing as the context window grows.
arXiv Detail & Related papers (2024-11-07T18:59:27Z) - DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels [89.51834016940153]
We introduce DetectiveQA, a narrative reasoning benchmark with an average context length of over 100K tokens.
We use detective novels as data sources, which naturally have various reasoning elements.
We manually annotated 600 questions in Chinese and then also provided an English edition of the context information and questions.
arXiv Detail & Related papers (2024-09-04T06:28:22Z) - Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA [71.04146366608904]
Long-context modeling capabilities have garnered widespread attention, leading to the emergence of Large Language Models (LLMs) with ultra-context windows.
We propose a novel long-context benchmark, Loong, aligning with realistic scenarios through extended multi-document question answering (QA)
Loong introduces four types of tasks with a range of context lengths: Spotlight Locating, Comparison, Clustering, and Chain of Reasoning.
arXiv Detail & Related papers (2024-06-25T09:42:56Z) - Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks [76.43527940649939]
We introduce Ada-LEval, a benchmark for evaluating the long-context understanding of large language models (LLMs)
Ada-LEval includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs' long context capabilities.
We evaluate 4 state-of-the-art closed-source API models and 6 open-source models with Ada-LEval.
arXiv Detail & Related papers (2024-04-09T17:30:48Z) - M4LE: A Multi-Ability Multi-Range Multi-Task Multi-Domain Long-Context Evaluation Benchmark for Large Language Models [58.54538318912159]
M4LE is a benchmark for evaluating the long-sequence capability of large language models (LLMs)
M4LE is based on a diverse NLP task pool comprising 36 NLP task types and 12 domains.
We conducted a systematic evaluation on 11 well-established LLMs, especially those optimized for long-sequence inputs.
arXiv Detail & Related papers (2023-10-30T03:11:30Z) - BAMBOO: A Comprehensive Benchmark for Evaluating Long Text Modeling Capacities of Large Language Models [141.21603469555225]
Large language models (LLMs) have achieved dramatic proficiency over NLP tasks with normal length.
We propose BAMBOO, a multi-task long context benchmark.
It consists of 10 datasets from 5 different long text understanding tasks.
arXiv Detail & Related papers (2023-09-23T11:36:15Z) - L-Eval: Instituting Standardized Evaluation for Long Context Language
Models [91.05820785008527]
We propose L-Eval to institute a more standardized evaluation for long context language models (LCLMs)
We build a new evaluation suite containing 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs.
Results show that popular n-gram matching metrics generally can not correlate well with human judgment.
arXiv Detail & Related papers (2023-07-20T17:59:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.