Topical Change Detection in Documents via Embeddings of Long Sequences
- URL: http://arxiv.org/abs/2012.03619v1
- Date: Mon, 7 Dec 2020 12:09:37 GMT
- Title: Topical Change Detection in Documents via Embeddings of Long Sequences
- Authors: Dennis Aumiller, Satya Almasian, Sebastian Lackner and Michael Gertz
- Abstract summary: We formulate the task of text segmentation as an independent supervised prediction task.
By fine-tuning on paragraphs of similar sections, we are able to show that learned features encode topic information.
Unlike previous approaches, which mostly operate on sentence-level, we consistently use a broader context.
- Score: 4.13878392637062
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In a longer document, the topic often slightly shifts from one passage to the
next, where topic boundaries are usually indicated by semantically coherent
segments. Discovering this latent structure in a document improves the
readability and is essential for passage retrieval and summarization tasks. We
formulate the task of text segmentation as an independent supervised prediction
task, making it suitable to train on Transformer-based language models. By
fine-tuning on paragraphs of similar sections, we are able to show that learned
features encode topic information, which can be used to find the section
boundaries and divide the text into coherent segments. Unlike previous
approaches, which mostly operate on sentence-level, we consistently use a
broader context of an entire paragraph and assume topical independence of
preceeding and succeeding text. We lastly introduce a novel large-scale dataset
constructed from online Terms-of-Service documents, on which we compare against
various traditional and deep learning baselines, showing significantly better
performance of Transformer-based methods.
Related papers
- Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation [9.703886326323644]
We introduce a new model - Segment any Text (SaT) - to solve this problem.
To enhance robustness, we propose a new pretraining scheme that ensures less reliance on punctuation.
To address adaptability, we introduce an extra stage of parameter-efficient fine-tuning, establishing state-of-the-art performance in distinct domains.
arXiv Detail & Related papers (2024-06-24T14:36:11Z) - From Text Segmentation to Smart Chaptering: A Novel Benchmark for
Structuring Video Transcriptions [63.11097464396147]
We introduce a novel benchmark YTSeg focusing on spoken content that is inherently more unstructured and both topically and structurally diverse.
We also introduce an efficient hierarchical segmentation model MiniSeg, that outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2024-02-27T15:59:37Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Toward Unifying Text Segmentation and Long Document Summarization [31.084738269628748]
We study the role that section segmentation plays in extractive summarization of written and spoken documents.
Our approach learns robust sentence representations by performing summarization and segmentation simultaneously.
Our findings suggest that the model can not only achieve state-of-the-art performance on publicly available benchmarks, but demonstrate better cross-genre transferability.
arXiv Detail & Related papers (2022-10-28T22:07:10Z) - Text Revision by On-the-Fly Representation Optimization [76.11035270753757]
Current state-of-the-art methods formulate these tasks as sequence-to-sequence learning problems.
We present an iterative in-place editing approach for text revision, which requires no parallel data.
It achieves competitive and even better performance than state-of-the-art supervised methods on text simplification.
arXiv Detail & Related papers (2022-04-15T07:38:08Z) - Revisiting Transformer-based Models for Long Document Classification [31.60414185940218]
In real-world applications, multi-page multi-paragraph documents are common and cannot be efficiently encoded by vanilla Transformer-based models.
We compare different Transformer-based Long Document Classification (TrLDC) approaches that aim to mitigate the computational overhead of vanilla transformers.
We observe a clear benefit from being able to process longer text, and, based on our results, we derive practical advice of applying Transformer-based models on long document classification tasks.
arXiv Detail & Related papers (2022-04-14T00:44:36Z) - HETFORMER: Heterogeneous Transformer with Sparse Attention for Long-Text
Extractive Summarization [57.798070356553936]
HETFORMER is a Transformer-based pre-trained model with multi-granularity sparse attentions for extractive summarization.
Experiments on both single- and multi-document summarization tasks show that HETFORMER achieves state-of-the-art performance in Rouge F1.
arXiv Detail & Related papers (2021-10-12T22:42:31Z) - WikiAsp: A Dataset for Multi-domain Aspect-based Summarization [69.13865812754058]
We propose WikiAsp, a large-scale dataset for multi-domain aspect-based summarization.
Specifically, we build the dataset using Wikipedia articles from 20 different domains, using the section titles and boundaries of each article as a proxy for aspect annotation.
Results highlight key challenges that existing summarization models face in this setting, such as proper pronoun handling of quoted sources and consistent explanation of time-sensitive events.
arXiv Detail & Related papers (2020-11-16T10:02:52Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.