LongWanjuan: Towards Systematic Measurement for Long Text Quality
- URL: http://arxiv.org/abs/2402.13583v2
- Date: Thu, 22 Feb 2024 03:06:55 GMT
- Title: LongWanjuan: Towards Systematic Measurement for Long Text Quality
- Authors: Kai Lv, Xiaoran Liu, Qipeng Guo, Hang Yan, Conghui He, Xipeng Qiu and
Dahua Lin
- Abstract summary: LongWanjuan is a dataset specifically tailored to enhance the training of language models for long-text tasks with over 160B tokens.
In LongWanjuan, we categorize long texts into holistic, aggregated, and chaotic types, enabling a detailed analysis of long-text quality.
We devise a data mixture recipe that strategically balances different types of long texts within LongWanjuan, leading to significant improvements in model performance on long-text tasks.
- Score: 102.46517202896521
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The quality of training data are crucial for enhancing the long-text
capabilities of foundation models. Despite existing efforts to refine data
quality through heuristic rules and evaluations based on data diversity and
difficulty, there's a lack of systematic approaches specifically tailored for
assessing long texts. Addressing this gap, our work systematically measures the
quality of long texts by evaluating three fundamental linguistic dimensions:
coherence, cohesion, and complexity. Drawing inspiration from the
aforementioned three dimensions, we introduce a suite of metrics designed to
evaluate the quality of long texts, encompassing both statistical and
pre-trained language model-based ones. Leveraging these metrics, we present
LongWanjuan, a bilingual dataset specifically tailored to enhance the training
of language models for long-text tasks with over 160B tokens. In LongWanjuan,
we categorize long texts into holistic, aggregated, and chaotic types, enabling
a detailed analysis of long-text quality. Furthermore, we devise a data mixture
recipe that strategically balances different types of long texts within
LongWanjuan, leading to significant improvements in model performance on
long-text tasks. The code and dataset are available at
https://github.com/OpenLMLab/LongWanjuan.
Related papers
- LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs [4.4965596747053]
Long-form text generation is critical for applications such as design proposals and creative writing.
New long-form text evaluation benchmark, LongGenBench, tests models' ability to identify specific events within generated long text sequences.
arXiv Detail & Related papers (2024-09-03T17:25:54Z) - LongLaMP: A Benchmark for Personalized Long-form Text Generation [87.41296912519992]
We develop the Long-text Language Model Personalization (LongLaMP) Benchmark.
LongLaMP provides a comprehensive and diverse evaluation framework for personalized long-text generation.
The results highlight the importance of personalization across a wide variety of long-text generation tasks.
arXiv Detail & Related papers (2024-06-27T01:52:05Z) - Effective Long-Context Scaling of Foundation Models [90.57254298730923]
We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens.
Our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2.
arXiv Detail & Related papers (2023-09-27T21:41:49Z) - BAMBOO: A Comprehensive Benchmark for Evaluating Long Text Modeling Capacities of Large Language Models [141.21603469555225]
Large language models (LLMs) have achieved dramatic proficiency over NLP tasks with normal length.
We propose BAMBOO, a multi-task long context benchmark.
It consists of 10 datasets from 5 different long text understanding tasks.
arXiv Detail & Related papers (2023-09-23T11:36:15Z) - Adapting Pretrained Text-to-Text Models for Long Text Sequences [39.62224414485055]
We adapt an existing pretrained text-to-text model for long-sequence inputs.
We build a long-context model that achieves competitive performance on long-text QA tasks.
arXiv Detail & Related papers (2022-09-21T00:41:07Z) - SCROLLS: Standardized CompaRison Over Long Language Sequences [62.574959194373264]
We introduce SCROLLS, a suite of tasks that require reasoning over long texts.
SCROLLS contains summarization, question answering, and natural language inference tasks.
We make all datasets available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.
arXiv Detail & Related papers (2022-01-10T18:47:15Z) - LOT: A Benchmark for Evaluating Chinese Long Text Understanding and
Generation [49.57366550980932]
Long text modeling requires many capabilities such as modeling long-range commonsense and discourse relations.
We propose LOT, a benchmark including two understanding and two generation tasks for Chinese long text modeling evaluation.
We release an encoder-decoder Chinese long text pretraining model named LongLM with up to 1 billion parameters.
arXiv Detail & Related papers (2021-08-30T02:38:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.