Related papers: Language Lexicons for Hindi-English Multilingual Text Processing

Language Lexicons for Hindi-English Multilingual Text Processing

URL: http://arxiv.org/abs/2106.15105v1
Date: Tue, 29 Jun 2021 05:42:54 GMT
Title: Language Lexicons for Hindi-English Multilingual Text Processing
Authors: Mohd Zeeshan Ansari, Tanvir Ahmad and Noaima Bari
Abstract summary: The present Language Identification techniques presume that a document contains text in one of the fixed set of languages. Due to the unavailability of large standard corpora for Hindi-English mixed lingual language processing tasks we propose the language lexicons. These lexicons are built by learning classifiers over transliterated Hindi and English vocabulary.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Language Identification in textual documents is the process of automatically detecting the language contained in a document based on its content. The present Language Identification techniques presume that a document contains text in one of the fixed set of languages, however, this presumption is incorrect when dealing with multilingual document which includes content in more than one possible language. Due to the unavailability of large standard corpora for Hindi-English mixed lingual language processing tasks we propose the language lexicons, a novel kind of lexical database that supports several multilingual language processing tasks. These lexicons are built by learning classifiers over transliterated Hindi and English vocabulary. The designed lexicons possess richer quantitative characteristic than its primary source of collection which is revealed using the visualization techniques.

Related papers

Fine-Tuned Self-Supervised Speech Representations for Language Diarization in Multilingual Code-Switched Speech [4.39549503760707]
We develop a continuous multilingual language diarizer using fine-tuned speech representations extracted from a large self-supervised architecture (WavLM) We experiment with a code-switched corpus consisting of five South African languages (isiZulu, isiXa, Setswana, Sesotho and English)
arXiv Detail & Related papers (2023-12-15T09:40:41Z)
Enhancing Cross-lingual Transfer via Phonemic Transcription Integration [57.109031654219294]
PhoneXL is a framework incorporating phonemic transcriptions as an additional linguistic modality for cross-lingual transfer. Our pilot study reveals phonemic transcription provides essential information beyond the orthography to enhance cross-lingual transfer.
arXiv Detail & Related papers (2023-07-10T06:17:33Z)
Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally. Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z)
A Simple and Efficient Probabilistic Language model for Code-Mixed Text [0.0]
We present a simple probabilistic approach for building efficient word embedding for code-mixed text. We examine its efficacy for the classification task using bidirectional LSTMs and SVMs.
arXiv Detail & Related papers (2021-06-29T05:37:57Z)
Generalising Multilingual Concept-to-Text NLG with Language Agnostic Delexicalisation [0.40611352512781856]
Concept-to-text Natural Language Generation is the task of expressing an input meaning representation in natural language. We propose Language Agnostic Delexicalisation, a novel delexicalisation method that uses multilingual pretrained embeddings. Our experiments across five datasets and five languages show that multilingual models outperform monolingual models in concept-to-text.
arXiv Detail & Related papers (2021-05-07T17:48:53Z)
FILTER: An Enhanced Fusion Method for Cross-lingual Language Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning. During inference, the model makes predictions based on the text input in the target language and its translation in the source language. To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z)
Discovering Bilingual Lexicons in Polyglot Word Embeddings [32.53342453685406]
In this work, we utilize a single Skip-gram model trained on a multilingual corpus yielding polyglot word embeddings. We present a novel finding that a surprisingly simple constrained nearest-neighbor sampling technique can retrieve bilingual lexicons. Across three European language pairs, we observe that polyglot word embeddings indeed learn a rich semantic representation of words.
arXiv Detail & Related papers (2020-08-31T03:57:50Z)
Bridging Linguistic Typology and Multilingual Machine Translation with Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source. We observe that our representations embed typology and strengthen correlations with language relationships. We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
Learning to Scale Multilingual Representations for Vision-Language Tasks [51.27839182889422]
The effectiveness of SMALR is demonstrated with ten diverse languages, over twice the number supported in vision-language tasks to date. We evaluate on multilingual image-sentence retrieval and outperform prior work by 3-4% with less than 1/5th the training parameters compared to other word embedding methods.
arXiv Detail & Related papers (2020-04-09T01:03:44Z)
Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages. Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs. Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.