On Target Segmentation for Direct Speech Translation
- URL: http://arxiv.org/abs/2009.04707v1
- Date: Thu, 10 Sep 2020 07:47:01 GMT
- Title: On Target Segmentation for Direct Speech Translation
- Authors: Mattia Antonino Di Gangi and Marco Gaido and Matteo Negri and Marco
Turchi
- Abstract summary: Subword-level segmentation became the state of the art in neural machine translation.
We compare the two methods on three benchmarks covering 8 language directions and multilingual training.
Subword-level segmentation compares favorably in all settings, outperforming its character-level counterpart in a range of 1 to 3 BLEU points.
- Score: 20.456325305495966
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent studies on direct speech translation show continuous improvements by
means of data augmentation techniques and bigger deep learning models. While
these methods are helping to close the gap between this new approach and the
more traditional cascaded one, there are many incongruities among different
studies that make it difficult to assess the state of the art. Surprisingly,
one point of discussion is the segmentation of the target text. Character-level
segmentation has been initially proposed to obtain an open vocabulary, but it
results on long sequences and long training time. Then, subword-level
segmentation became the state of the art in neural machine translation as it
produces shorter sequences that reduce the training time, while being superior
to word-level models. As such, recent works on speech translation started using
target subwords despite the initial use of characters and some recent claims of
better results at the character level. In this work, we perform an extensive
comparison of the two methods on three benchmarks covering 8 language
directions and multilingual training. Subword-level segmentation compares
favorably in all settings, outperforming its character-level counterpart in a
range of 1 to 3 BLEU points.
Related papers
- Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Using Contextual Information for Sentence-level Morpheme Segmentation [0.0]
We redefine morpheme segmentation as a sequence-to-sequence problem, treating the entire sentence as input rather than isolating individual words.
Our findings reveal that the multilingual model consistently exhibits superior performance compared to monolingual counterparts.
arXiv Detail & Related papers (2024-03-15T20:12:32Z) - T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text
Classification [50.675552118811]
Cross-lingual text classification is typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest.
We propose revisiting the classic "translate-and-test" pipeline to neatly separate the translation and classification stages.
arXiv Detail & Related papers (2023-06-08T07:33:22Z) - Where's the Point? Self-Supervised Multilingual Punctuation-Agnostic
Sentence Segmentation [65.6736056006381]
We present a multilingual punctuation-agnostic sentence segmentation method covering 85 languages.
Our method outperforms all the prior best sentence-segmentation tools by an average of 6.1% F1 points.
By using our method to match sentence segmentation to the segmentation used during training of MT models, we achieve an average improvement of 2.3 BLEU points.
arXiv Detail & Related papers (2023-05-30T09:49:42Z) - Subword Segmental Machine Translation: Unifying Segmentation and Target
Sentence Generation [7.252933737829635]
Subword segmental machine translation (SSMT) learns to segment target sentence words while jointly learning to generate target sentences.
Experiments across 6 translation directions show that SSMT improves chrF scores for morphologically rich agglutinative languages.
arXiv Detail & Related papers (2023-05-11T17:44:29Z) - To Augment or Not to Augment? A Comparative Study on Text Augmentation
Techniques for Low-Resource NLP [0.0]
We investigate three categories of text augmentation methodologies which perform changes on the syntax.
We compare them on part-of-speech tagging, dependency parsing and semantic role labeling for a diverse set of language families.
Our results suggest that the augmentation techniques can further improve over strong baselines based on mBERT.
arXiv Detail & Related papers (2021-11-18T10:52:48Z) - Neural Sequence Segmentation as Determining the Leftmost Segments [25.378188980430256]
We propose a novel framework that incrementally segments natural language sentences at segment level.
For every step in segmentation, it recognizes the leftmost segment of the remaining sequence.
We have conducted extensive experiments on syntactic chunking and Chinese part-of-speech tagging across 3 datasets.
arXiv Detail & Related papers (2021-04-15T03:35:03Z) - Multi-view Subword Regularization [111.04350390045705]
Multi-view Subword Regularization (MVR) is a method that enforces the consistency between predictions of using inputs tokenized by the standard and probabilistic segmentations.
Results on the XTREME multilingual benchmark show that MVR brings consistent improvements of up to 2.5 points over using standard segmentation algorithms.
arXiv Detail & Related papers (2021-03-15T16:07:42Z) - BURT: BERT-inspired Universal Representation from Twin Structure [89.82415322763475]
BURT (BERT inspired Universal Representation from Twin Structure) is capable of generating universal, fixed-size representations for input sequences of any granularity.
Our proposed BURT adopts the Siamese network, learning sentence-level representations from natural language inference dataset and word/phrase-level representations from paraphrasing dataset.
We evaluate BURT across different granularities of text similarity tasks, including STS tasks, SemEval2013 Task 5(a) and some commonly used word similarity tasks.
arXiv Detail & Related papers (2020-04-29T04:01:52Z) - Robust Cross-lingual Embeddings from Parallel Sentences [65.85468628136927]
We propose a bilingual extension of the CBOW method which leverages sentence-aligned corpora to obtain robust cross-lingual word representations.
Our approach significantly improves crosslingual sentence retrieval performance over all other approaches.
It also achieves parity with a deep RNN method on a zero-shot cross-lingual document classification task.
arXiv Detail & Related papers (2019-12-28T16:18:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.