Byte Pair Encoding Is All You Need For Automatic Bengali Speech
Recognition
- URL: http://arxiv.org/abs/2401.15532v1
- Date: Sun, 28 Jan 2024 00:41:21 GMT
- Title: Byte Pair Encoding Is All You Need For Automatic Bengali Speech
Recognition
- Authors: Ahnaf Mozib Samin
- Abstract summary: Byte pair encoding (BPE) emerges as an effective tokenization method for tackling the out-of-vocabulary (OOV) challenge.
Recent research highlights the dependency of BPE subword tokenization's efficacy on the morphological nature of the language.
Our study empirically identifies the optimal number of BPE tokens for Bengali, a language known for its morphological complexity.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Byte pair encoding (BPE) emerges as an effective tokenization method for
tackling the out-of-vocabulary (OOV) challenge in various natural language and
speech processing tasks. Recent research highlights the dependency of BPE
subword tokenization's efficacy on the morphological nature of the language,
particularly in languages rich in inflectional morphology, where fewer BPE
merges suffice for generating highly productive tokens. Motivated by this, our
study empirically identifies the optimal number of BPE tokens for Bengali, a
language known for its morphological complexity, thus enhancing
out-of-distribution automatic speech recognition (ASR) performance.
Experimental evaluation reveals that an excessively high number of BPE tokens
can lead to overfitting, while approximately 500-1000 tokens result in superior
OOV performance. Furthermore, we conduct a comparative analysis of BPE with
character-based and unigram-based tokenization methods. By introducing BPE
tokenization to Bengali ASR, we achieve a substantial reduction in the word
error rate (WER) from 66.44% in our character-based baseline system to 63.80%
on the LB-ASRTD eval set and from 46.34% to 42.80% on the SHRUTI eval set, both
of which include out-of-distribution data.
Related papers
- MorphTok: Morphologically Grounded Tokenization for Indian Languages [23.58043476541051]
Tokenization is a crucial step in NLP, especially with the rise of large language models (LLMs)
We propose morphology-aware segmentation as a pre-tokenization step prior to applying subword tokenization.
We also introduce Constrained BPE, an extension to the traditional BPE algorithm that incorporates script-specific constraints.
arXiv Detail & Related papers (2025-04-14T15:44:45Z) - SuperBPE: Space Travel for Language Models [112.64910939119056]
We introduce a "superword" tokenizer, SuperBPE, which incorporates a simple pretokenization curriculum into the byte-pair encoding (BPE) algorithm.
SuperBPE encodes a fixed piece of text with up to 33% fewer tokens than BPE on average.
Our model trained with SuperBPE achieves an average +4.0% absolute improvement over the BPE baseline across 30 downstream tasks.
arXiv Detail & Related papers (2025-03-17T17:53:23Z) - Enhancing Multilingual ASR for Unseen Languages via Language Embedding Modeling [50.62091603179394]
Whisper, one of the most advanced ASR models, handles 99 languages effectively.
However, Whisper struggles with unseen languages, those not included in its pre-training.
We propose methods that exploit these relationships to enhance ASR performance on unseen languages.
arXiv Detail & Related papers (2024-12-21T04:05:43Z) - When Every Token Counts: Optimal Segmentation for Low-Resource Language Models [0.0]
We show that an optimal Byte-Pair (BPE) configuration significantly reduces token count compared to greedy segmentation.
Our findings suggest that compression-optimized tokenization strategies could provide substantial advantages for multilingual and low-resource language applications.
arXiv Detail & Related papers (2024-12-09T19:11:54Z) - Signs as Tokens: A Retrieval-Enhanced Multilingual Sign Language Generator [55.94334001112357]
We introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs.
We propose a retrieval-enhanced SLG approach, which incorporates external sign dictionaries to provide accurate word-level signs.
arXiv Detail & Related papers (2024-11-26T18:28:09Z) - LBPE: Long-token-first Tokenization to Improve Large Language Models [26.3619552256488]
Long tokens, rich in semantic information, have fewer occurrences in tokenized datasets compared to short tokens.
We propose LBPE, which prioritizes long tokens during the encoding process.
Experiments across diverse language modeling tasks demonstrate that LBPE consistently outperforms the original BPE.
arXiv Detail & Related papers (2024-11-08T12:03:36Z) - Exploring SSL Discrete Tokens for Multilingual ASR [29.059205486691745]
This study presents a comparison of discrete tokens generated by various leading SSL models across multiple language domains.
Experimental results demonstrate that discrete tokens achieve comparable results against systems trained on Fbank features in ASR tasks.
arXiv Detail & Related papers (2024-09-13T13:13:39Z) - Batching BPE Tokenization Merges [55.2480439325792]
BatchBPE is an open-source pure Python implementation of the Byte Pair algorithm.
It is used to train a high quality tokenizer on a basic laptop.
arXiv Detail & Related papers (2024-08-05T09:37:21Z) - Scaffold-BPE: Enhancing Byte Pair Encoding for Large Language Models with Simple and Effective Scaffold Token Removal [58.29382184006158]
We propose Scaffold-BPE, which incorporates a dynamic scaffold token removal mechanism by parameter-free, computation-light, and easy-to-implement modifications to the original BPE method.
On extensive experiments across language modeling and even machine translation, Scaffold-BPE consistently outperforms the original BPE.
arXiv Detail & Related papers (2024-04-27T07:12:07Z) - Tokenization Is More Than Compression [14.939912120571728]
Existing tokenization approaches like Byte-Pair.
(BPE) originate from the field of data compression.
We introduce PathPiece, a new tokenizer that segments a document's text into the minimum number of tokens for a given vocabulary.
arXiv Detail & Related papers (2024-02-28T14:52:15Z) - Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy
in Mental Health and Beyond [66.07002187192448]
We propose task-adaptive tokenization as a way to adapt the generation pipeline to the specifics of a downstream task.
We introduce a strategy for building a specialized vocabulary and introduce a vocabulary merging protocol.
We find that our task-adaptive tokenization approach brings a significant improvement in generation performance while using up to 60% fewer tokens.
arXiv Detail & Related papers (2023-10-09T00:20:59Z) - SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural
Machine Translation [51.881877192924414]
Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT)
This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method.
SelfSeg is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora.
arXiv Detail & Related papers (2023-07-31T04:38:47Z) - Bilingual End-to-End ASR with Byte-Level Subwords [4.268218327369146]
We study different representations including character-level, byte-level, byte pair encoding (BPE), and byte-level byte pair encoding (BBPE)
We focus on developing a single end-to-end model to support utterance-based bilingual ASR, where speakers do not alternate between two languages in a single utterance but may change languages across utterances.
We find that BBPE with penalty schemes can improve utterance-based bilingual ASR performance by 2% to 5% relative even with smaller number of outputs and fewer parameters.
arXiv Detail & Related papers (2022-05-01T15:01:01Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z) - Byte Pair Encoding is Suboptimal for Language Model Pretraining [49.30780227162387]
We analyze differences between unigram LM tokenization and byte-pair encoding (BPE)
We find that the unigram LM tokenization method matches or outperforms BPE across downstream tasks and two languages.
We hope that developers of future pretrained LMs will consider adopting the unigram LM method over the more prevalent BPE.
arXiv Detail & Related papers (2020-04-07T21:21:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.