Sentence Identification with BOS and EOS Label Combinations
- URL: http://arxiv.org/abs/2301.13352v1
- Date: Tue, 31 Jan 2023 01:03:07 GMT
- Title: Sentence Identification with BOS and EOS Label Combinations
- Authors: Takuma Udagawa, Hiroshi Kanayama, Issei Yoshida
- Abstract summary: We formulate a novel task of sentence identification, where the goal is to identify SUs while excluding NSUs in a given text.
We propose a simple yet effective method which combines the beginning of the sentence (BOS) and EOS labels to determine the most probable SUs and NSUs.
Our experiments on the sentence identification task demonstrate that our proposed method generally outperforms sentence segmentation baselines which only utilize EOS labels.
- Score: 7.053475270377054
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The sentence is a fundamental unit in many NLP applications. Sentence
segmentation is widely used as the first preprocessing task, where an input
text is split into consecutive sentences considering the end of the sentence
(EOS) as their boundaries. This task formulation relies on a strong assumption
that the input text consists only of sentences, or what we call the sentential
units (SUs). However, real-world texts often contain non-sentential units
(NSUs) such as metadata, sentence fragments, nonlinguistic markers, etc. which
are unreasonable or undesirable to be treated as a part of an SU. To tackle
this issue, we formulate a novel task of sentence identification, where the
goal is to identify SUs while excluding NSUs in a given text. To conduct
sentence identification, we propose a simple yet effective method which
combines the beginning of the sentence (BOS) and EOS labels to determine the
most probable SUs and NSUs based on dynamic programming. To evaluate this task,
we design an automatic, language-independent procedure to convert the Universal
Dependencies corpora into sentence identification benchmarks. Finally, our
experiments on the sentence identification task demonstrate that our proposed
method generally outperforms sentence segmentation baselines which only utilize
EOS labels.
Related papers
- Improving Zero-shot Sentence Decontextualisation with Content Selection and Planning [15.992477600061166]
We propose a framework for zero-shot decontextualisation, which determines what content should be mentioned and in what order for a sentence to be understood out of context.<n>We identify potentially ambiguous units from the given sentence, and extract relevant units from the context based on their discourse relations.<n>Finally, we generate a content plan to rewrite the sentence by enriching each ambiguous unit with its relevant units.
arXiv Detail & Related papers (2025-09-22T15:47:07Z) - A Straightforward Pipeline for Targeted Entailment and Contradiction Detection [0.15229257192293197]
Key challenge is to identify which sentences act as premises or contradictions for a specific claim.<n>We introduce a method that combines the strengths of both approaches for a targeted analysis.<n>By filtering NLI-identified relationships with attention-based saliency scores, our method efficiently isolates the most significant semantic relationships for any given claim in a text.
arXiv Detail & Related papers (2025-08-23T19:59:24Z) - Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning [53.57895922042783]
Large Language Models (LLMs) excel at reasoning and planning when trained on chainof-thought (CoT) data.<n>We propose a hybrid representation of the reasoning process, where we partially abstract away the initial reasoning steps using latent discrete tokens.
arXiv Detail & Related papers (2025-02-05T15:33:00Z) - A Collocation-based Method for Addressing Challenges in Word-level Metric Differential Privacy [3.0177210416625124]
Several word-level $textitMetric$ Differential Privacy approaches have been proposed.
We devise a method where composed privatized outputs have higher semantic coherence and variable length.
We evaluate our method in utility and privacy tests, which make a clear case for tokenization strategies beyond the word level.
arXiv Detail & Related papers (2024-06-30T09:37:34Z) - SemStamp: A Semantic Watermark with Paraphrastic Robustness for Text Generation [72.10931780019297]
Existing watermarking algorithms are vulnerable to paraphrase attacks because of their token-level design.
We propose SemStamp, a robust sentence-level semantic watermarking algorithm based on locality-sensitive hashing (LSH)
Experimental results show that our novel semantic watermark algorithm is not only more robust than the previous state-of-the-art method on both common and bigram paraphrase attacks, but also is better at preserving the quality of generation.
arXiv Detail & Related papers (2023-10-06T03:33:42Z) - IDAS: Intent Discovery with Abstractive Summarization [16.731183915325584]
We show that recent competitive methods in intent discovery can be outperformed by clustering utterances based on abstractive summaries.
We contribute the IDAS approach, which collects a set of descriptive utterance labels by prompting a Large Language Model.
The utterances and their resulting noisy labels are then encoded by a frozen pre-trained encoder, and subsequently clustered to recover the latent intents.
arXiv Detail & Related papers (2023-05-31T12:19:40Z) - RankCSE: Unsupervised Sentence Representations Learning via Learning to
Rank [54.854714257687334]
We propose a novel approach, RankCSE, for unsupervised sentence representation learning.
It incorporates ranking consistency and ranking distillation with contrastive learning into a unified framework.
An extensive set of experiments are conducted on both semantic textual similarity (STS) and transfer (TR) tasks.
arXiv Detail & Related papers (2023-05-26T08:27:07Z) - PropSegmEnt: A Large-Scale Corpus for Proposition-Level Segmentation and
Entailment Recognition [63.51569687229681]
We argue for the need to recognize the textual entailment relation of each proposition in a sentence individually.
We propose PropSegmEnt, a corpus of over 45K propositions annotated by expert human raters.
Our dataset structure resembles the tasks of (1) segmenting sentences within a document to the set of propositions, and (2) classifying the entailment relation of each proposition with respect to a different yet topically-aligned document.
arXiv Detail & Related papers (2022-12-21T04:03:33Z) - Textual Entailment Recognition with Semantic Features from Empirical
Text Representation [60.31047947815282]
A text entails a hypothesis if and only if the true value of the hypothesis follows the text.
In this paper, we propose a novel approach to identifying the textual entailment relationship between text and hypothesis.
We employ an element-wise Manhattan distance vector-based feature that can identify the semantic entailment relationship between the text-hypothesis pair.
arXiv Detail & Related papers (2022-10-18T10:03:51Z) - A New Sentence Ordering Method Using BERT Pretrained Model [2.1793134762413433]
We propose a method for sentence ordering which does not need a training phase and consequently a large corpus for learning.
Our proposed method outperformed other baselines on ROCStories, a corpus of 5-sentence human-made stories.
Among other advantages of this method are its interpretability and needlessness to linguistic knowledge.
arXiv Detail & Related papers (2021-08-26T18:47:15Z) - UCPhrase: Unsupervised Context-aware Quality Phrase Tagging [63.86606855524567]
UCPhrase is a novel unsupervised context-aware quality phrase tagger.
We induce high-quality phrase spans as silver labels from consistently co-occurring word sequences.
We show that our design is superior to state-of-the-art pre-trained, unsupervised, and distantly supervised methods.
arXiv Detail & Related papers (2021-05-28T19:44:24Z) - Reformulating Sentence Ordering as Conditional Text Generation [17.91448517871621]
We present Reorder-BART (RE-BART), a sentence ordering framework.
We reformulate the task as a conditional text-to-marker generation setup.
Our framework achieves the state-of-the-art performance across six datasets in Perfect Match Ratio (PMR) and Kendall's tau ($tau$) metric.
arXiv Detail & Related papers (2021-04-14T18:16:47Z) - Narrative Incoherence Detection [76.43894977558811]
We propose the task of narrative incoherence detection as a new arena for inter-sentential semantic understanding.
Given a multi-sentence narrative, decide whether there exist any semantic discrepancies in the narrative flow.
arXiv Detail & Related papers (2020-12-21T07:18:08Z) - ReSCo-CC: Unsupervised Identification of Key Disinformation Sentences [3.7405995078130148]
We propose a novel unsupervised task of identifying sentences containing key disinformation within a document that is known to be untrustworthy.
We design a three-phase statistical NLP solution for the task which starts with embedding sentences within a bespoke feature space designed for the task.
We show that our method is able to identify core disinformation effectively.
arXiv Detail & Related papers (2020-10-21T08:53:36Z) - Research on Annotation Rules and Recognition Algorithm Based on Phrase
Window [4.334276223622026]
We propose labeling rules based on phrase windows, and designed corresponding phrase recognition algorithms.
The labeling rule uses phrases as the minimum unit, di-vides sentences into 7 types of nestable phrase types, and marks the grammatical dependencies between phrases.
The corresponding algorithm, drawing on the idea of identifying the target area in the image field, can find the start and end positions of various phrases in the sentence.
arXiv Detail & Related papers (2020-07-07T00:19:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.