Language Lexicons for Hindi-English Multilingual Text Processing
- URL: http://arxiv.org/abs/2106.15105v1
- Date: Tue, 29 Jun 2021 05:42:54 GMT
- Title: Language Lexicons for Hindi-English Multilingual Text Processing
- Authors: Mohd Zeeshan Ansari, Tanvir Ahmad and Noaima Bari
- Abstract summary: The present Language Identification techniques presume that a document contains text in one of the fixed set of languages.
Due to the unavailability of large standard corpora for Hindi-English mixed lingual language processing tasks we propose the language lexicons.
These lexicons are built by learning classifiers over transliterated Hindi and English vocabulary.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Language Identification in textual documents is the process of automatically
detecting the language contained in a document based on its content. The
present Language Identification techniques presume that a document contains
text in one of the fixed set of languages, however, this presumption is
incorrect when dealing with multilingual document which includes content in
more than one possible language. Due to the unavailability of large standard
corpora for Hindi-English mixed lingual language processing tasks we propose
the language lexicons, a novel kind of lexical database that supports several
multilingual language processing tasks. These lexicons are built by learning
classifiers over transliterated Hindi and English vocabulary. The designed
lexicons possess richer quantitative characteristic than its primary source of
collection which is revealed using the visualization techniques.
Related papers
- Fine-Tuned Self-Supervised Speech Representations for Language
Diarization in Multilingual Code-Switched Speech [4.39549503760707]
We develop a continuous multilingual language diarizer using fine-tuned speech representations extracted from a large self-supervised architecture (WavLM)
We experiment with a code-switched corpus consisting of five South African languages (isiZulu, isiXa, Setswana, Sesotho and English)
arXiv Detail & Related papers (2023-12-15T09:40:41Z) - Enhancing Cross-lingual Transfer via Phonemic Transcription Integration [57.109031654219294]
PhoneXL is a framework incorporating phonemic transcriptions as an additional linguistic modality for cross-lingual transfer.
Our pilot study reveals phonemic transcription provides essential information beyond the orthography to enhance cross-lingual transfer.
arXiv Detail & Related papers (2023-07-10T06:17:33Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - A Simple and Efficient Probabilistic Language model for Code-Mixed Text [0.0]
We present a simple probabilistic approach for building efficient word embedding for code-mixed text.
We examine its efficacy for the classification task using bidirectional LSTMs and SVMs.
arXiv Detail & Related papers (2021-06-29T05:37:57Z) - Generalising Multilingual Concept-to-Text NLG with Language Agnostic
Delexicalisation [0.40611352512781856]
Concept-to-text Natural Language Generation is the task of expressing an input meaning representation in natural language.
We propose Language Agnostic Delexicalisation, a novel delexicalisation method that uses multilingual pretrained embeddings.
Our experiments across five datasets and five languages show that multilingual models outperform monolingual models in concept-to-text.
arXiv Detail & Related papers (2021-05-07T17:48:53Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - Discovering Bilingual Lexicons in Polyglot Word Embeddings [32.53342453685406]
In this work, we utilize a single Skip-gram model trained on a multilingual corpus yielding polyglot word embeddings.
We present a novel finding that a surprisingly simple constrained nearest-neighbor sampling technique can retrieve bilingual lexicons.
Across three European language pairs, we observe that polyglot word embeddings indeed learn a rich semantic representation of words.
arXiv Detail & Related papers (2020-08-31T03:57:50Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - Learning to Scale Multilingual Representations for Vision-Language Tasks [51.27839182889422]
The effectiveness of SMALR is demonstrated with ten diverse languages, over twice the number supported in vision-language tasks to date.
We evaluate on multilingual image-sentence retrieval and outperform prior work by 3-4% with less than 1/5th the training parameters compared to other word embedding methods.
arXiv Detail & Related papers (2020-04-09T01:03:44Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.