Detect Language of Transliterated Texts
- URL: http://arxiv.org/abs/2004.13521v1
- Date: Sun, 26 Apr 2020 10:28:02 GMT
- Title: Detect Language of Transliterated Texts
- Authors: Sourav Sen
- Abstract summary: Informal transliteration from other languages to English is prevalent in social media threads, instant messaging, and discussion forums.
We propose a Language Identification (LID) system, with an approach for feature extraction.
We tokenize the words into phonetic syllables and use a simple Long Short-term Memory (LSTM) network architecture to detect the language of transliterated texts.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Informal transliteration from other languages to English is prevalent in
social media threads, instant messaging, and discussion forums. Without
identifying the language of such transliterated text, users who do not speak
that language cannot understand its content using translation tools. We propose
a Language Identification (LID) system, with an approach for feature
extraction, which can detect the language of transliterated texts reasonably
well even with limited training data and computational resources. We tokenize
the words into phonetic syllables and use a simple Long Short-term Memory
(LSTM) network architecture to detect the language of transliterated texts.
With intensive experiments, we show that the tokenization of transliterated
words as phonetic syllables effectively represents their causal sound patterns.
Phonetic syllable tokenization, therefore, makes it easier for even simpler
model architectures to learn the characteristic patterns to identify any
language.
Related papers
- Lip Reading for Low-resource Languages by Learning and Combining General
Speech Knowledge and Language-specific Knowledge [57.38948190611797]
This paper proposes a novel lip reading framework, especially for low-resource languages.
Since low-resource languages do not have enough video-text paired data to train the model, it is regarded as challenging to develop lip reading models for low-resource languages.
arXiv Detail & Related papers (2023-08-18T05:19:03Z) - Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - Discrete Cross-Modal Alignment Enables Zero-Shot Speech Translation [71.35243644890537]
End-to-end Speech Translation (ST) aims at translating the source language speech into target language text without generating the intermediate transcriptions.
Existing zero-shot methods fail to align the two modalities of speech and text into a shared semantic space.
We propose a novel Discrete Cross-Modal Alignment (DCMA) method that employs a shared discrete vocabulary space to accommodate and match both modalities of speech and text.
arXiv Detail & Related papers (2022-10-18T03:06:47Z) - Applying Feature Underspecified Lexicon Phonological Features in
Multilingual Text-to-Speech [1.9688095374610102]
We present a mapping of ARPABET/pinyin to SAMPA/SAMPA-SC and then to phonological features.
This mapping was tested for whether it could lead to the successful generation of native, non-native, and code-switched speech in the two languages.
arXiv Detail & Related papers (2022-04-14T21:04:55Z) - Language-Agnostic Meta-Learning for Low-Resource Text-to-Speech with
Articulatory Features [30.37026279162593]
In this work, we use embeddings derived from articulatory vectors rather than embeddings derived from phoneme identities to learn phoneme representations that hold across languages.
This enables us to fine-tune a high-quality text-to-speech model on just 30 minutes of data in a previously unseen language spoken by a previously unseen speaker.
arXiv Detail & Related papers (2022-03-07T07:58:01Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - A Simple and Efficient Probabilistic Language model for Code-Mixed Text [0.0]
We present a simple probabilistic approach for building efficient word embedding for code-mixed text.
We examine its efficacy for the classification task using bidirectional LSTMs and SVMs.
arXiv Detail & Related papers (2021-06-29T05:37:57Z) - CSTNet: Contrastive Speech Translation Network for Self-Supervised
Speech Representation Learning [11.552745999302905]
More than half of the 7,000 languages in the world are in imminent danger of going extinct.
It is relatively easy to obtain textual translations corresponding to speech.
We construct a convolutional neural network audio encoder capable of extracting linguistic representations from speech.
arXiv Detail & Related papers (2020-06-04T12:21:48Z) - Meta-Transfer Learning for Code-Switched Speech Recognition [72.84247387728999]
We propose a new learning method, meta-transfer learning, to transfer learn on a code-switched speech recognition system in a low-resource setting.
Our model learns to recognize individual languages, and transfer them so as to better recognize mixed-language speech by conditioning the optimization on the code-switching data.
arXiv Detail & Related papers (2020-04-29T14:27:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.