Where's the Point? Self-Supervised Multilingual Punctuation-Agnostic
Sentence Segmentation
- URL: http://arxiv.org/abs/2305.18893v1
- Date: Tue, 30 May 2023 09:49:42 GMT
- Title: Where's the Point? Self-Supervised Multilingual Punctuation-Agnostic
Sentence Segmentation
- Authors: Benjamin Minixhofer, Jonas Pfeiffer, Ivan Vuli\'c
- Abstract summary: We present a multilingual punctuation-agnostic sentence segmentation method covering 85 languages.
Our method outperforms all the prior best sentence-segmentation tools by an average of 6.1% F1 points.
By using our method to match sentence segmentation to the segmentation used during training of MT models, we achieve an average improvement of 2.3 BLEU points.
- Score: 65.6736056006381
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many NLP pipelines split text into sentences as one of the crucial
preprocessing steps. Prior sentence segmentation tools either rely on
punctuation or require a considerable amount of sentence-segmented training
data: both central assumptions might fail when porting sentence segmenters to
diverse languages on a massive scale. In this work, we thus introduce a
multilingual punctuation-agnostic sentence segmentation method, currently
covering 85 languages, trained in a self-supervised fashion on unsegmented
text, by making use of newline characters which implicitly perform segmentation
into paragraphs. We further propose an approach that adapts our method to the
segmentation in a given corpus by using only a small number (64-256) of
sentence-segmented examples. The main results indicate that our method
outperforms all the prior best sentence-segmentation tools by an average of
6.1% F1 points. Furthermore, we demonstrate that proper sentence segmentation
has a point: the use of a (powerful) sentence segmenter makes a considerable
difference for a downstream application such as machine translation (MT). By
using our method to match sentence segmentation to the segmentation used during
training of MT models, we achieve an average improvement of 2.3 BLEU points
over the best prior segmentation tool, as well as massive gains over a trivial
segmenter that splits text into equally sized blocks.
Related papers
- Scalable and Domain-General Abstractive Proposition Segmentation [20.532804009152255]
We focus on the task of abstractive proposition segmentation (APS): transforming text into simple, self-contained, well-formed sentences.
We first introduce evaluation metrics for the task to measure several dimensions of quality.
We then propose a scalable, yet accurate, proposition segmentation model.
arXiv Detail & Related papers (2024-06-28T10:24:31Z) - Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation [9.703886326323644]
We introduce a new model - Segment any Text (SaT) - to solve this problem.
To enhance robustness, we propose a new pretraining scheme that ensures less reliance on punctuation.
To address adaptability, we introduce an extra stage of parameter-efficient fine-tuning, establishing state-of-the-art performance in distinct domains.
arXiv Detail & Related papers (2024-06-24T14:36:11Z) - Lexically Grounded Subword Segmentation [0.0]
We present three innovations in tokenization and subword segmentation.
First, we propose to use unsupervised morphological analysis with Morfessor as pre-tokenization.
Second, we present an method for obtaining subword embeddings grounded in a word embedding space.
Third, we introduce an efficient segmentation algorithm based on a subword bigram model.
arXiv Detail & Related papers (2024-06-19T13:48:19Z) - SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural
Machine Translation [51.881877192924414]
Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT)
This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method.
SelfSeg is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora.
arXiv Detail & Related papers (2023-07-31T04:38:47Z) - Diffusion Models for Open-Vocabulary Segmentation [79.02153797465324]
OVDiff is a novel method that leverages generative text-to-image diffusion models for unsupervised open-vocabulary segmentation.
It relies solely on pre-trained components and outputs the synthesised segmenter directly, without training.
arXiv Detail & Related papers (2023-06-15T17:51:28Z) - Speech Segmentation Optimization using Segmented Bilingual Speech Corpus
for End-to-end Speech Translation [16.630616128169372]
We propose a speech segmentation method using a binary classification model trained using a segmented bilingual speech corpus.
Experimental results revealed that the proposed method is more suitable for cascade and end-to-end ST systems than conventional segmentation methods.
arXiv Detail & Related papers (2022-03-29T12:26:56Z) - SHAS: Approaching optimal Segmentation for End-to-End Speech Translation [0.0]
Speech translation models are unable to directly process long audios, like TED talks, which have to be split into shorter segments.
We propose Supervised Hybrid Audio (SHAS), a method that can effectively learn the optimal segmentation from any manually segmented speech corpus.
Experiments on MuST-C and mTEDx show that SHAS retains 95-98% of the manual segmentation's BLEU score, compared to the 87-93% of the best existing methods.
arXiv Detail & Related papers (2022-02-09T23:55:25Z) - Multi-view Subword Regularization [111.04350390045705]
Multi-view Subword Regularization (MVR) is a method that enforces the consistency between predictions of using inputs tokenized by the standard and probabilistic segmentations.
Results on the XTREME multilingual benchmark show that MVR brings consistent improvements of up to 2.5 points over using standard segmentation algorithms.
arXiv Detail & Related papers (2021-03-15T16:07:42Z) - A Differentiable Relaxation of Graph Segmentation and Alignment for AMR
Parsing [75.36126971685034]
We treat alignment and segmentation as latent variables in our model and induce them as part of end-to-end training.
Our method also approaches that of a model that relies on citetLyu2018AMRPA's segmentation rules, which were hand-crafted to handle individual AMR constructions.
arXiv Detail & Related papers (2020-10-23T21:22:50Z) - Dynamic Programming Encoding for Subword Segmentation in Neural Machine
Translation [80.38621085548013]
This paper introduces Dynamic Programming (DPE) a new segmentation algorithm for tokenizing sentences into subword units.
A mixed character-subword transformer is proposed, which enables exact log marginal likelihood estimation and exact MAP inference to find target segmentations.
arXiv Detail & Related papers (2020-05-03T05:00:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.