An Empirical Survey on Long Document Summarization: Datasets, Models and
Metrics
- URL: http://arxiv.org/abs/2207.00939v1
- Date: Sun, 3 Jul 2022 02:57:22 GMT
- Title: An Empirical Survey on Long Document Summarization: Datasets, Models and
Metrics
- Authors: Huan Yee Koh, Jiaxin Ju, Ming Liu, Shirui Pan
- Abstract summary: We provide a comprehensive overview of the research on long document summarization.
We conduct an empirical analysis to broaden the perspective on current research progress.
- Score: 33.655334920298856
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Long documents such as academic articles and business reports have been the
standard format to detail out important issues and complicated subjects that
require extra attention. An automatic summarization system that can effectively
condense long documents into short and concise texts to encapsulate the most
important information would thus be significant in aiding the reader's
comprehension. Recently, with the advent of neural architectures, significant
research efforts have been made to advance automatic text summarization
systems, and numerous studies on the challenges of extending these systems to
the long document domain have emerged. In this survey, we provide a
comprehensive overview of the research on long document summarization and a
systematic evaluation across the three principal components of its research
setting: benchmark datasets, summarization models, and evaluation metrics. For
each component, we organize the literature within the context of long document
summarization and conduct an empirical analysis to broaden the perspective on
current research progress. The empirical analysis includes a study on the
intrinsic characteristics of benchmark datasets, a multi-dimensional analysis
of summarization models, and a review of the summarization evaluation metrics.
Based on the overall findings, we conclude by proposing possible directions for
future exploration in this rapidly growing field.
Related papers
- SurveySum: A Dataset for Summarizing Multiple Scientific Articles into a Survey Section [7.366861473623427]
This paper introduces a novel dataset designed for summarizing multiple scientific articles into a section of a survey.
Our contributions are: (1) SurveySum, a new dataset addressing the gap in domain-specific summarization tools; (2) two specific pipelines to summarize scientific articles into a section of a survey; and (3) the evaluation of these pipelines using multiple metrics to compare their performance.
arXiv Detail & Related papers (2024-08-29T11:13:23Z) - LongWanjuan: Towards Systematic Measurement for Long Text Quality [102.46517202896521]
LongWanjuan is a dataset specifically tailored to enhance the training of language models for long-text tasks with over 160B tokens.
In LongWanjuan, we categorize long texts into holistic, aggregated, and chaotic types, enabling a detailed analysis of long-text quality.
We devise a data mixture recipe that strategically balances different types of long texts within LongWanjuan, leading to significant improvements in model performance on long-text tasks.
arXiv Detail & Related papers (2024-02-21T07:27:18Z) - A Literature Review of Literature Reviews in Pattern Analysis and Machine Intelligence [58.6354685593418]
This paper proposes several article-level, field-normalized, and large language model-empowered bibliometric indicators to evaluate reviews.
The newly emerging AI-generated literature reviews are also appraised.
This work offers insights into the current challenges of literature reviews and envisions future directions for their development.
arXiv Detail & Related papers (2024-02-20T11:28:50Z) - QuOTeS: Query-Oriented Technical Summarization [0.2936007114555107]
We propose QuOTeS, an interactive system designed to retrieve sentences related to a summary of the research from a collection of potential references.
QuOTeS integrates techniques from Query-Focused Extractive Summarization and High-Recall Information Retrieval to provide Interactive Query-Focused Summarization of scientific documents.
The results show that QuOTeS provides a positive user experience and consistently provides query-focused summaries that are relevant, concise, and complete.
arXiv Detail & Related papers (2023-06-20T18:43:24Z) - Making Science Simple: Corpora for the Lay Summarisation of Scientific
Literature [21.440724685950443]
We present two novel lay summarisation datasets, PLOS (large-scale) and eLife (medium-scale)
We provide a thorough characterisation of our lay summaries, highlighting differing levels of readability and abstractiveness between datasets.
arXiv Detail & Related papers (2022-10-18T15:28:30Z) - Automatic Text Summarization Methods: A Comprehensive Review [1.6114012813668934]
This study provides a detailed analysis of text summarization concepts such as summarization approaches, techniques used, standard datasets, evaluation metrics and future scopes for research.
arXiv Detail & Related papers (2022-03-03T10:45:00Z) - Deep Learning Schema-based Event Extraction: Literature Review and
Current Trends [60.29289298349322]
Event extraction technology based on deep learning has become a research hotspot.
This paper fills the gap by reviewing the state-of-the-art approaches, focusing on deep learning-based models.
arXiv Detail & Related papers (2021-07-05T16:32:45Z) - Bringing Structure into Summaries: a Faceted Summarization Dataset for
Long Scientific Documents [30.09742243490895]
FacetSum is a faceted summarization benchmark built on Emerald journal articles.
Analyses and empirical results on our dataset reveal the importance of bringing structure into summaries.
We believe FacetSum will spur further advances in summarization research and foster the development of NLP systems.
arXiv Detail & Related papers (2021-05-31T22:58:38Z) - Summarizing Text on Any Aspects: A Knowledge-Informed Weakly-Supervised
Approach [89.56158561087209]
We study summarizing on arbitrary aspects relevant to the document.
Due to the lack of supervision data, we develop a new weak supervision construction method and an aspect modeling scheme.
Experiments show our approach achieves performance boosts on summarizing both real and synthetic documents.
arXiv Detail & Related papers (2020-10-14T03:20:46Z) - A Survey on Text Classification: From Shallow to Deep Learning [83.47804123133719]
The last decade has seen a surge of research in this area due to the unprecedented success of deep learning.
This paper fills the gap by reviewing the state-of-the-art approaches from 1961 to 2021.
We create a taxonomy for text classification according to the text involved and the models used for feature extraction and classification.
arXiv Detail & Related papers (2020-08-02T00:09:03Z) - From Standard Summarization to New Tasks and Beyond: Summarization with
Manifold Information [77.89755281215079]
Text summarization is the research area aiming at creating a short and condensed version of the original document.
In real-world applications, most of the data is not in a plain text format.
This paper focuses on the survey of these new summarization tasks and approaches in the real-world application.
arXiv Detail & Related papers (2020-05-10T14:59:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.