LOT: A Benchmark for Evaluating Chinese Long Text Understanding and
Generation
- URL: http://arxiv.org/abs/2108.12960v1
- Date: Mon, 30 Aug 2021 02:38:32 GMT
- Title: LOT: A Benchmark for Evaluating Chinese Long Text Understanding and
Generation
- Authors: Jian Guan, Zhuoer Feng, Yamei Chen, Ruilin He, Xiaoxi Mao, Changjie
Fan, Minlie Huang
- Abstract summary: Long text modeling requires many capabilities such as modeling long-range commonsense and discourse relations.
We propose LOT, a benchmark including two understanding and two generation tasks for Chinese long text modeling evaluation.
We release an encoder-decoder Chinese long text pretraining model named LongLM with up to 1 billion parameters.
- Score: 49.57366550980932
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Standard multi-task benchmarks are essential for driving the progress of
general pretraining models to generalize to various downstream tasks. However,
existing benchmarks such as GLUE and GLGE tend to focus on short text
understanding and generation tasks, without considering long text modeling,
which requires many distinct capabilities such as modeling long-range
commonsense and discourse relations, as well as the coherence and
controllability of generation. The lack of standardized benchmarks makes it
difficult to fully evaluate these capabilities of a model and fairly compare
different models, especially Chinese pretraining models. Therefore, we propose
LOT, a benchmark including two understanding and two generation tasks for
Chinese long text modeling evaluation. We construct the datasets for the tasks
based on various kinds of human-written Chinese stories. Besides, we release an
encoder-decoder Chinese long text pretraining model named LongLM with up to 1
billion parameters. We pretrain LongLM on 120G Chinese novels with two
generative tasks including text infilling and conditional continuation.
Extensive experiments on LOT demonstrate that LongLM matches the performance of
similar-sized pretraining models on the understanding tasks and outperforms
strong baselines substantially on the generation tasks.
Related papers
- P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
Large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning.
Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks.
We present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks.
We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.
arXiv Detail & Related papers (2024-11-14T01:29:36Z) - LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs [4.4965596747053]
Long-form text generation is critical for applications such as design proposals and creative writing.
New long-form text evaluation benchmark, LongGenBench, tests models' ability to identify specific events within generated long text sequences.
arXiv Detail & Related papers (2024-09-03T17:25:54Z) - XL$^2$Bench: A Benchmark for Extremely Long Context Understanding with Long-range Dependencies [45.31042312867939]
Large Language Models (LLMs) have demonstrated remarkable performance across diverse tasks but are constrained by their small context window sizes.
Various efforts have been proposed to expand the context window to accommodate even up to 200K input tokens.
We introduce a benchmark for extremely long context understanding with long-range dependencies, XL$2$Bench.
arXiv Detail & Related papers (2024-04-08T12:29:07Z) - LongWanjuan: Towards Systematic Measurement for Long Text Quality [102.46517202896521]
LongWanjuan is a dataset specifically tailored to enhance the training of language models for long-text tasks with over 160B tokens.
In LongWanjuan, we categorize long texts into holistic, aggregated, and chaotic types, enabling a detailed analysis of long-text quality.
We devise a data mixture recipe that strategically balances different types of long texts within LongWanjuan, leading to significant improvements in model performance on long-text tasks.
arXiv Detail & Related papers (2024-02-21T07:27:18Z) - Effective Long-Context Scaling of Foundation Models [90.57254298730923]
We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens.
Our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2.
arXiv Detail & Related papers (2023-09-27T21:41:49Z) - BAMBOO: A Comprehensive Benchmark for Evaluating Long Text Modeling Capacities of Large Language Models [141.21603469555225]
Large language models (LLMs) have achieved dramatic proficiency over NLP tasks with normal length.
We propose BAMBOO, a multi-task long context benchmark.
It consists of 10 datasets from 5 different long text understanding tasks.
arXiv Detail & Related papers (2023-09-23T11:36:15Z) - LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding [58.20031627237889]
LongBench is the first bilingual, multi-task benchmark for long context understanding.
It comprises 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words (English) and 13,386 characters (Chinese)
arXiv Detail & Related papers (2023-08-28T11:53:40Z) - MuLD: The Multitask Long Document Benchmark [4.835289158553091]
We present a new long document benchmark consisting of only documents over 10,000 tokens.
We show that models with increased context length are better able to solve the tasks presented.
arXiv Detail & Related papers (2022-02-15T12:42:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.