Byte Pair Encoding Is All You Need For Automatic Bengali Speech
Recognition
- URL: http://arxiv.org/abs/2401.15532v1
- Date: Sun, 28 Jan 2024 00:41:21 GMT
- Title: Byte Pair Encoding Is All You Need For Automatic Bengali Speech
Recognition
- Authors: Ahnaf Mozib Samin
- Abstract summary: Byte pair encoding (BPE) emerges as an effective tokenization method for tackling the out-of-vocabulary (OOV) challenge.
Recent research highlights the dependency of BPE subword tokenization's efficacy on the morphological nature of the language.
Our study empirically identifies the optimal number of BPE tokens for Bengali, a language known for its morphological complexity.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Byte pair encoding (BPE) emerges as an effective tokenization method for
tackling the out-of-vocabulary (OOV) challenge in various natural language and
speech processing tasks. Recent research highlights the dependency of BPE
subword tokenization's efficacy on the morphological nature of the language,
particularly in languages rich in inflectional morphology, where fewer BPE
merges suffice for generating highly productive tokens. Motivated by this, our
study empirically identifies the optimal number of BPE tokens for Bengali, a
language known for its morphological complexity, thus enhancing
out-of-distribution automatic speech recognition (ASR) performance.
Experimental evaluation reveals that an excessively high number of BPE tokens
can lead to overfitting, while approximately 500-1000 tokens result in superior
OOV performance. Furthermore, we conduct a comparative analysis of BPE with
character-based and unigram-based tokenization methods. By introducing BPE
tokenization to Bengali ASR, we achieve a substantial reduction in the word
error rate (WER) from 66.44% in our character-based baseline system to 63.80%
on the LB-ASRTD eval set and from 46.34% to 42.80% on the SHRUTI eval set, both
of which include out-of-distribution data.
Related papers
- Scaffold-BPE: Enhancing Byte Pair Encoding with Simple and Effective Scaffold Token Removal [25.406520591282366]
We propose Scaffold-BPE, which incorporates a dynamic scaffold token removal mechanism by parameter-free, computation-light, and easy-to-implement modifications to the original BPE.
On extensive experiments across language modeling tasks and machine translation tasks, Scaffold-BPE consistently outperforms the original BPE.
arXiv Detail & Related papers (2024-04-27T07:12:07Z) - Tokenization Is More Than Compression [15.689084780238597]
Existing tokenization approaches like Byte-Pair.
(BPE) originate from the field of data compression, and it has been suggested that BPE stems from its ability to condense text into a relatively small number of tokens.
We test the hypothesis that fewer tokens lead to better downstream performance by introducing PathPiece, a new tokenizer that segments a document's text into the minimum number of tokens for a given vocabulary.
arXiv Detail & Related papers (2024-02-28T14:52:15Z) - Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy
in Mental Health and Beyond [66.07002187192448]
We propose task-adaptive tokenization as a way to adapt the generation pipeline to the specifics of a downstream task.
We introduce a strategy for building a specialized vocabulary and introduce a vocabulary merging protocol.
We find that our task-adaptive tokenization approach brings a significant improvement in generation performance while using up to 60% fewer tokens.
arXiv Detail & Related papers (2023-10-09T00:20:59Z) - SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural
Machine Translation [51.881877192924414]
Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT)
This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method.
SelfSeg is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora.
arXiv Detail & Related papers (2023-07-31T04:38:47Z) - On the N-gram Approximation of Pre-trained Language Models [17.764803904135903]
Large pre-trained language models (PLMs) have shown remarkable performance across various natural language understanding (NLU) tasks.
This study investigates the potential usage of PLMs for language modelling in Automatic Speech Recognition (ASR)
We compare the application of large-scale text sampling and probability conversion for approximating GPT-2 into an n-gram model.
arXiv Detail & Related papers (2023-06-12T06:42:08Z) - Bilingual End-to-End ASR with Byte-Level Subwords [4.268218327369146]
We study different representations including character-level, byte-level, byte pair encoding (BPE), and byte-level byte pair encoding (BBPE)
We focus on developing a single end-to-end model to support utterance-based bilingual ASR, where speakers do not alternate between two languages in a single utterance but may change languages across utterances.
We find that BBPE with penalty schemes can improve utterance-based bilingual ASR performance by 2% to 5% relative even with smaller number of outputs and fewer parameters.
arXiv Detail & Related papers (2022-05-01T15:01:01Z) - Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of
Code-Mixed Clinical Texts [56.72488923420374]
Pre-trained language models (LMs) have shown great potential for cross-lingual transfer in low-resource settings.
We show the few-shot cross-lingual transfer property of LMs for named recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke.
arXiv Detail & Related papers (2022-04-10T21:46:52Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z) - Dynamic Programming Encoding for Subword Segmentation in Neural Machine
Translation [80.38621085548013]
This paper introduces Dynamic Programming (DPE) a new segmentation algorithm for tokenizing sentences into subword units.
A mixed character-subword transformer is proposed, which enables exact log marginal likelihood estimation and exact MAP inference to find target segmentations.
arXiv Detail & Related papers (2020-05-03T05:00:50Z) - Byte Pair Encoding is Suboptimal for Language Model Pretraining [49.30780227162387]
We analyze differences between unigram LM tokenization and byte-pair encoding (BPE)
We find that the unigram LM tokenization method matches or outperforms BPE across downstream tasks and two languages.
We hope that developers of future pretrained LMs will consider adopting the unigram LM method over the more prevalent BPE.
arXiv Detail & Related papers (2020-04-07T21:21:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.