Related papers: Summarizing long regulatory documents with a multi-step pipeline

Summarizing long regulatory documents with a multi-step pipeline

URL: http://arxiv.org/abs/2408.09777v2
Date: Mon, 14 Oct 2024 12:12:19 GMT
Title: Summarizing long regulatory documents with a multi-step pipeline
Authors: Mika Sie, Ruby Beek, Michiel Bots, Sjaak Brinkkemper, Albert Gatt,
Abstract summary: We show that the effectiveness of a two-step architecture for summarizing long regulatory texts varies depending on the model used. For abstractive encoder-decoder models with short context lengths, the effectiveness of an extractive step varies, whereas for long-context encoder-decoder models, the extractive step worsens their performance.
Score: 2.2591852560804675
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Due to their length and complexity, long regulatory texts are challenging to summarize. To address this, a multi-step extractive-abstractive architecture is proposed to handle lengthy regulatory documents more effectively. In this paper, we show that the effectiveness of a two-step architecture for summarizing long regulatory texts varies significantly depending on the model used. Specifically, the two-step architecture improves the performance of decoder-only models. For abstractive encoder-decoder models with short context lengths, the effectiveness of an extractive step varies, whereas for long-context encoder-decoder models, the extractive step worsens their performance. This research also highlights the challenges of evaluating generated texts, as evidenced by the differing results from human and automated evaluations. Most notably, human evaluations favoured language models pretrained on legal text, while automated metrics rank general-purpose language models higher. The results underscore the importance of selecting the appropriate summarization strategy based on model architecture and context length.

Related papers

SuperWriter: Reflection-Driven Long-Form Generation with Large Language Models [34.723917246316205]
SuperWriter-Agent is a framework designed to enhance the quality and consistency of long-form text generation.<n>Based on this framework, we construct a supervised fine-tuning dataset to train a 7B SuperWriter-LM.<n> Empirical results across diverse benchmarks demonstrate that SuperWriter-LM achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-06-04T17:27:42Z)
Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning [103.65680870130839]
We investigate how to design instruction data for the post-training phase of a long context pre-trained model. Our controlled study reveals that models instruction-tuned on short contexts can effectively generalize to longer ones. Based on these findings, we propose context synthesis, a novel data synthesis framework.
arXiv Detail & Related papers (2025-02-21T17:02:40Z)
Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation. We introduce novel methodologies and datasets to overcome these challenges. We propose MhBART, an encoder-decoder model designed to emulate human writing style. We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z)
Assessment of Transformer-Based Encoder-Decoder Model for Human-Like Summarization [0.05852077003870416]
This work leverages transformer-based BART model for human-like summarization. On training and fine-tuning the encoder-decoder model, it is tested with diverse sample articles. The finetuned model performance is compared with the baseline pretrained model. Empirical results on BBC News articles highlight that the gold standard summaries written by humans are more factually consistent by 17%.
arXiv Detail & Related papers (2024-10-22T09:25:04Z)
A Novel LLM-based Two-stage Summarization Approach for Long Dialogues [9.835499880812646]
This study proposes a hierarchical framework that segments and condenses information from long documents. The condensation stage utilizes an unsupervised generation model to generate condensed data. The summarization stage fine-tunes the abstractive summarization model on the condensed data to generate the final results.
arXiv Detail & Related papers (2024-10-09T03:42:40Z)
LOCOST: State-Space Models for Long Document Abstractive Summarization [76.31514220737272]
We propose LOCOST: an encoder-decoder architecture based on state-space models for conditional text generation with long context inputs. With a computational complexity of $O(L log L)$, this architecture can handle significantly longer sequences than state-of-the-art models that are based on sparse attention patterns.
arXiv Detail & Related papers (2024-01-31T15:33:37Z)
SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code Summarization [51.67317895094664]
This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects. We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
arXiv Detail & Related papers (2024-01-26T09:23:27Z)
Split and Rephrase with Large Language Models [2.499907423888049]
Split and Rephrase (SPRP) task consists in splitting complex sentences into a sequence of shorter grammatical sentences. We evaluate large language models on the task, showing that they can provide large improvements over the state of the art on the main metrics.
arXiv Detail & Related papers (2023-12-18T10:16:37Z)
Effective Long-Context Scaling of Foundation Models [90.57254298730923]
We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens. Our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2.
arXiv Detail & Related papers (2023-09-27T21:41:49Z)
Finding the Needle in a Haystack: Unsupervised Rationale Extraction from Long Text Classifiers [20.10172411803626]
We propose a compositional soft attention architecture that applies RoBERTa sentence-wise to extract plausible rationales at the token-level. We find this method to significantly outperform Longformer-driven baselines on sentiment classification datasets.
arXiv Detail & Related papers (2023-03-14T15:45:35Z)
A Survey on Long Text Modeling with Transformers [106.50471784909212]
We provide an overview of the recent advances on long texts modeling based on Transformer models.<n>We discuss how to process long input to satisfy the length limitation and design improved Transformer architectures.<n>We describe four typical applications involving long text modeling and conclude this paper with a discussion of future directions.
arXiv Detail & Related papers (2023-02-28T11:34:30Z)
Adapting Pretrained Text-to-Text Models for Long Text Sequences [39.62224414485055]
We adapt an existing pretrained text-to-text model for long-sequence inputs. We build a long-context model that achieves competitive performance on long-text QA tasks.
arXiv Detail & Related papers (2022-09-21T00:41:07Z)
Long Document Summarization with Top-down and Bottom-up Inference [113.29319668246407]
We propose a principled inference framework to improve summarization models on two aspects. Our framework assumes a hierarchical latent structure of a document where the top-level captures the long range dependency. We demonstrate the effectiveness of the proposed framework on a diverse set of summarization datasets.
arXiv Detail & Related papers (2022-03-15T01:24:51Z)
Rethinking Text Line Recognition Models [57.47147190119394]
We consider two decoder families (Connectionist Temporal Classification and Transformer) and three encoder modules (Bidirectional LSTMs, Self-Attention, and GRCLs) We compare their accuracy and performance on widely used public datasets of scene and handwritten text. Unlike the more common Transformer-based models, this architecture can handle inputs of arbitrary length.
arXiv Detail & Related papers (2021-04-15T21:43:13Z)
Progressive Generation of Long Text with Pretrained Language Models [83.62523163717448]
Large-scale language models (LMs) pretrained on massive corpora of text, such as GPT-2, are powerful open-domain text generators. It is still challenging for such models to generate coherent long passages of text, especially when the models are fine-tuned to the target domain on a small corpus. We propose a simple but effective method of generating text in a progressive manner, inspired by generating images from low to high resolution.
arXiv Detail & Related papers (2020-06-28T21:23:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.