Segmentation en phrases : ouvrez les guillemets sans perdre le fil
- URL: http://arxiv.org/abs/2407.19808v1
- Date: Mon, 29 Jul 2024 09:02:38 GMT
- Title: Segmentation en phrases : ouvrez les guillemets sans perdre le fil
- Authors: Sandrine Ollinger, Denis Maurel,
- Abstract summary: This paper presents a graph cascade for sentence segmentation of XML documents.
Our proposal offers sentences inside sentences for cases introduced by quotation marks and hyphens, and also pays particular attention to situations involving incises introduced by parentheses and lists introduced by colons.
- Score: 0.08192907805418582
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a graph cascade for sentence segmentation of XML documents. Our proposal offers sentences inside sentences for cases introduced by quotation marks and hyphens, and also pays particular attention to situations involving incises introduced by parentheses and lists introduced by colons. We present how the tool works and compare the results obtained with those available in 2019 on the same dataset, together with an evaluation of the system's performance on a test corpus
Related papers
- Dense X Retrieval: What Retrieval Granularity Should We Use? [56.90827473115201]
Often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence.
We introduce a novel retrieval unit, proposition, for dense retrieval.
Experiments reveal that indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks.
arXiv Detail & Related papers (2023-12-11T18:57:35Z) - SentAlign: Accurate and Scalable Sentence Alignment [4.363828136730248]
SentAlign is an accurate sentence alignment tool designed to handle very large parallel document pairs.
The alignment algorithm evaluates all possible alignment paths in fairly large documents of thousands of sentences and uses a divide-and-conquer approach to align documents containing tens of thousands of sentences.
arXiv Detail & Related papers (2023-11-15T14:15:41Z) - Where's the Point? Self-Supervised Multilingual Punctuation-Agnostic
Sentence Segmentation [65.6736056006381]
We present a multilingual punctuation-agnostic sentence segmentation method covering 85 languages.
Our method outperforms all the prior best sentence-segmentation tools by an average of 6.1% F1 points.
By using our method to match sentence segmentation to the segmentation used during training of MT models, we achieve an average improvement of 2.3 BLEU points.
arXiv Detail & Related papers (2023-05-30T09:49:42Z) - Text Summarization with Oracle Expectation [88.39032981994535]
Extractive summarization produces summaries by identifying and concatenating the most important sentences in a document.
Most summarization datasets do not come with gold labels indicating whether document sentences are summary-worthy.
We propose a simple yet effective labeling algorithm that creates soft, expectation-based sentence labels.
arXiv Detail & Related papers (2022-09-26T14:10:08Z) - Clustering and Network Analysis for the Embedding Spaces of Sentences
and Sub-Sentences [69.3939291118954]
This paper reports research on a set of comprehensive clustering and network analyses targeting sentence and sub-sentence embedding spaces.
Results show that one method generates the most clusterable embeddings.
In general, the embeddings of span sub-sentences have better clustering properties than the original sentences.
arXiv Detail & Related papers (2021-10-02T00:47:35Z) - On the Use of Context for Predicting Citation Worthiness of Sentences in
Scholarly Articles [10.28696219236292]
We formulate this problem as a sequence labeling task solved using a hierarchical BiLSTM model.
We contribute a new benchmark dataset containing over two million sentences and their corresponding labels.
Our results quantify the benefits of using context and contextual embeddings for citation worthiness.
arXiv Detail & Related papers (2021-04-18T21:47:30Z) - Evaluating Sentence Segmentation and Word Tokenization Systems on
Estonian Web Texts [0.533024001730262]
We first describe the manual annotation of sentence boundaries of an Estonian web dataset.
We then present the evaluation results of three existing sentence segmentation and word tokenization systems on this corpus.
arXiv Detail & Related papers (2020-11-16T11:13:41Z) - An Unsupervised Semantic Sentence Ranking Scheme for Text Documents [9.272728720669846]
Semantic SentenceRank (SSR) is an unsupervised scheme for ranking sentences in a single document according to their relative importance.
It extracts essential words and phrases from a text document, and uses semantic measures to construct, respectively, a semantic phrase graph over phrases and words, and a semantic sentence graph over sentences.
arXiv Detail & Related papers (2020-04-28T20:17:51Z) - Extractive Summarization as Text Matching [123.09816729675838]
This paper creates a paradigm shift with regard to the way we build neural extractive summarization systems.
We formulate the extractive summarization task as a semantic text matching problem.
We have driven the state-of-the-art extractive result on CNN/DailyMail to a new level (44.41 in ROUGE-1)
arXiv Detail & Related papers (2020-04-19T08:27:57Z) - Automatic Discourse Segmentation: an evaluation in French [65.00134288222509]
We describe some discursive segmentation methods as well as a preliminary evaluation of the segmentation quality.
We have developed three models solely based on resources simultaneously available in several languages: marker lists and a statistic POS labeling.
arXiv Detail & Related papers (2020-02-10T21:35:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.