WikiSplit++: Easy Data Refinement for Split and Rephrase
- URL: http://arxiv.org/abs/2404.09002v1
- Date: Sat, 13 Apr 2024 13:07:32 GMT
- Title: WikiSplit++: Easy Data Refinement for Split and Rephrase
- Authors: Hayato Tsukagoshi, Tsutomu Hirao, Makoto Morishita, Katsuki Chousa, Ryohei Sasano, Koichi Takeda,
- Abstract summary: Split and Rephrase splits a complex sentence into multiple simple sentences with the same meaning.
We create WikiSplit++ by removing instances in WikiSplit where complex sentences do not entail at least one of the simpler sentences.
Our approach yields significant gains in the number of splits and the entailment ratio, a proxy for measuring hallucinations.
- Score: 19.12982606032723
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of Split and Rephrase, which splits a complex sentence into multiple simple sentences with the same meaning, improves readability and enhances the performance of downstream tasks in natural language processing (NLP). However, while Split and Rephrase can be improved using a text-to-text generation approach that applies encoder-decoder models fine-tuned with a large-scale dataset, it still suffers from hallucinations and under-splitting. To address these issues, this paper presents a simple and strong data refinement approach. Here, we create WikiSplit++ by removing instances in WikiSplit where complex sentences do not entail at least one of the simpler sentences and reversing the order of reference simple sentences. Experimental results show that training with WikiSplit++ leads to better performance than training with WikiSplit, even with fewer training instances. In particular, our approach yields significant gains in the number of splits and the entailment ratio, a proxy for measuring hallucinations.
Related papers
- Context-Aware Hierarchical Merging for Long Document Summarization [56.96619074316232]
We propose different approaches to enrich hierarchical merging with context from the source document.
Experimental results on datasets representing legal and narrative domains show that contextual augmentation consistently outperforms zero-shot and hierarchical merging baselines.
arXiv Detail & Related papers (2025-02-03T01:14:31Z) - Training LLMs over Neurally Compressed Text [55.11828645767342]
This paper explores the idea of training large language models (LLMs) over highly compressed text.
We propose Equal-Info Windows, a novel compression technique whereby text is segmented into blocks that each compress to the same bit length.
We demonstrate effective learning over neurally compressed text that improves with scale, and outperforms byte-level baselines by a wide margin on perplexity and inference speed benchmarks.
arXiv Detail & Related papers (2024-04-04T17:48:28Z) - Syntactic Complexity Identification, Measurement, and Reduction Through
Controlled Syntactic Simplification [0.0]
We present a classical syntactic dependency-based approach to split and rephrase a compound and complex sentence into a set of simplified sentences.
The paper also introduces an algorithm to identify and measure a sentence's syntactic complexity.
This work is accepted and presented in International workshop on Learning with Knowledge Graphs (IWLKG) at WSDM-2023 Conference.
arXiv Detail & Related papers (2023-04-16T13:13:58Z) - Benchmarking Long-tail Generalization with Likelihood Splits [20.47194488430863]
We propose a method to create challenging benchmarks that require generalizing to the tail of the distribution by re-splitting existing datasets.
We create 'Likelihood Splits' where examples that are assigned lower likelihood by a pre-trained language model are placed in the test set, and more likely examples are in the training set.
arXiv Detail & Related papers (2022-10-13T07:27:14Z) - BiSECT: Learning to Split and Rephrase Sentences with Bitexts [25.385804867037937]
We introduce a novel dataset and a new model for this split and rephrase' task.
BiSECT training data consists of 1 million long English sentences paired with shorter, meaning-equivalent English sentences.
We categorize examples in our corpus, and use these categories in a novel model that allows us to target specific regions of the input sentence to be split and edited.
arXiv Detail & Related papers (2021-09-10T17:30:14Z) - ABCD: A Graph Framework to Convert Complex Sentences to a Covering Set
of Simple Sentences [7.639576741566091]
We propose a new task to decompose each complex sentence into simple sentences derived from the tensed clauses in the source.
Our neural model learns to Accept, Break, Copy or Drop elements of a graph that combines word adjacency and grammatical dependencies.
We introduce DeSSE, a new dataset designed to train and evaluate complex sentence decomposition.
arXiv Detail & Related papers (2021-06-22T19:31:28Z) - Three Sentences Are All You Need: Local Path Enhanced Document Relation
Extraction [54.95848026576076]
We present an embarrassingly simple but effective method to select evidence sentences for document-level RE.
We have released our code at https://github.com/AndrewZhe/Three-Sentences-Are-All-You-Need.
arXiv Detail & Related papers (2021-06-03T12:29:40Z) - Narrative Incoherence Detection [76.43894977558811]
We propose the task of narrative incoherence detection as a new arena for inter-sentential semantic understanding.
Given a multi-sentence narrative, decide whether there exist any semantic discrepancies in the narrative flow.
arXiv Detail & Related papers (2020-12-21T07:18:08Z) - ASSET: A Dataset for Tuning and Evaluation of Sentence Simplification
Models with Multiple Rewriting Transformations [97.27005783856285]
This paper introduces ASSET, a new dataset for assessing sentence simplification in English.
We show that simplifications in ASSET are better at capturing characteristics of simplicity when compared to other standard evaluation datasets for the task.
arXiv Detail & Related papers (2020-05-01T16:44:54Z) - Extractive Summarization as Text Matching [123.09816729675838]
This paper creates a paradigm shift with regard to the way we build neural extractive summarization systems.
We formulate the extractive summarization task as a semantic text matching problem.
We have driven the state-of-the-art extractive result on CNN/DailyMail to a new level (44.41 in ROUGE-1)
arXiv Detail & Related papers (2020-04-19T08:27:57Z) - Fact-aware Sentence Split and Rephrase with Permutation Invariant
Training [93.66323661321113]
Sentence Split and Rephrase aims to break down a complex sentence into several simple sentences with its meaning preserved.
Previous studies tend to address the issue by seq2seq learning from parallel sentence pairs.
We introduce Permutation Training to verifies the effects of order variance in seq2seq learning for this task.
arXiv Detail & Related papers (2020-01-16T07:30:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.