Syllable Subword Tokens for Open Vocabulary Speech Recognition in
Malayalam
- URL: http://arxiv.org/abs/2301.06736v1
- Date: Tue, 17 Jan 2023 07:29:47 GMT
- Title: Syllable Subword Tokens for Open Vocabulary Speech Recognition in
Malayalam
- Authors: Kavya Manohar, A. R. Jayan, Rajeev Rajan
- Abstract summary: A lexicon pronunciation (PL) and a language model (LM) are essential to correctly retrieve spoken word sequences.
Being a morphologically complex language, the vocabulary of Malayalam is so huge and it is impossible to build a PL and an LM that cover all diverse word forms.
Usage of subword tokens to build PL and LM, and combining them to form words after decoding, enables the recovery of many out of vocabulary words.
- Score: 2.7823528791601695
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In a hybrid automatic speech recognition (ASR) system, a pronunciation
lexicon (PL) and a language model (LM) are essential to correctly retrieve
spoken word sequences. Being a morphologically complex language, the vocabulary
of Malayalam is so huge and it is impossible to build a PL and an LM that cover
all diverse word forms. Usage of subword tokens to build PL and LM, and
combining them to form words after decoding, enables the recovery of many out
of vocabulary words. In this work we investigate the impact of using syllables
as subword tokens instead of words in Malayalam ASR, and evaluate the relative
improvement in lexicon size, model memory requirement and word error rate.
Related papers
- Subword models struggle with word learning, but surprisal hides it [8.883534683127415]
We study word learning in subword and character language models with the psycholinguistic lexical decision task.
While subword LMs struggle to discern words and non-words with high accuracy, character LMs solve this task easily and consistently.
arXiv Detail & Related papers (2025-02-18T13:09:16Z) - From Tokens to Words: On the Inner Lexicon of LLMs [7.148628740938674]
Natural language is composed of words, but modern LLMs process sub-words as input.
We present evidence that LLMs engage in an intrinsic detokenization process, where sub-word sequences are combined into coherent word representations.
Our findings suggest that LLMs maintain a latent vocabulary beyond the tokenizer's scope.
arXiv Detail & Related papers (2024-10-08T09:53:35Z) - Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
In this article, we tackle the challenge of developing ASR systems without paired speech and text corpora.
We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.
This innovative model surpasses the performance of previous unsupervised ASR models under the lexicon-free setting.
arXiv Detail & Related papers (2024-06-12T16:30:58Z) - Introducing Syllable Tokenization for Low-resource Languages: A Case Study with Swahili [29.252250069388687]
Tokenization allows for the words to be split based on characters or subwords, creating word embeddings that best represent the structure of the language.
We propose a syllable tokenizer and adopt an experiment-centric approach to validate the proposed tokenizer based on the Swahili language.
arXiv Detail & Related papers (2024-03-26T17:26:50Z) - Generative Spoken Language Model based on continuous word-sized audio
tokens [52.081868603603844]
We introduce a Generative Spoken Language Model based on word-size continuous-valued audio embeddings.
The resulting model is the first generative language model based on word-size continuous embeddings.
arXiv Detail & Related papers (2023-10-08T16:46:14Z) - Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence
Encoders [85.80950708769923]
We probe multilingual language models for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs.
We also devise a novel method to expose this knowledge by additionally fine-tuning multilingual models.
We report substantial gains on standard benchmarks.
arXiv Detail & Related papers (2022-04-30T13:23:16Z) - Short-Term Word-Learning in a Dynamically Changing Environment [63.025297637716534]
We show how to supplement an end-to-end ASR system with a word/phrase memory and a mechanism to access this memory to recognize the words and phrases correctly.
We demonstrate significant improvements in the detection rate of new words with only a minor increase in false alarms.
arXiv Detail & Related papers (2022-03-29T10:05:39Z) - Subword Mapping and Anchoring across Languages [1.9352552677009318]
Subword Mapping and Anchoring across Languages (SMALA) is a method to construct bilingual subword vocabularies.
SMALA extracts subword alignments using an unsupervised state-of-the-art mapping technique.
We show that joint subword vocabularies obtained with SMALA lead to higher BLEU scores on sentences that contain many false positives and false negatives.
arXiv Detail & Related papers (2021-09-09T20:46:27Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - SubICap: Towards Subword-informed Image Captioning [37.42085521950802]
We decompose words into smaller constituent units'subwords' and represent captions as a sequence of subwords instead of words.
Our captioning system improves various metric scores, with a training vocabulary size approximately 90% less than the baseline.
arXiv Detail & Related papers (2020-12-24T06:10:36Z) - Techniques for Vocabulary Expansion in Hybrid Speech Recognition Systems [54.49880724137688]
The problem of out of vocabulary words (OOV) is typical for any speech recognition system.
One of the popular approach to cover OOVs is to use subword units rather then words.
In this paper we explore different existing methods of this solution on both graph construction and search method levels.
arXiv Detail & Related papers (2020-03-19T21:24:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.