A Simple and Efficient Probabilistic Language model for Code-Mixed Text
- URL: http://arxiv.org/abs/2106.15102v1
- Date: Tue, 29 Jun 2021 05:37:57 GMT
- Title: A Simple and Efficient Probabilistic Language model for Code-Mixed Text
- Authors: M Zeeshan Ansari, Tanvir Ahmad, M M Sufyan Beg, Asma Ikram
- Abstract summary: We present a simple probabilistic approach for building efficient word embedding for code-mixed text.
We examine its efficacy for the classification task using bidirectional LSTMs and SVMs.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The conventional natural language processing approaches are not accustomed to
the social media text due to colloquial discourse and non-homogeneous
characteristics. Significantly, the language identification in a multilingual
document is ascertained to be a preceding subtask in several information
extraction applications such as information retrieval, named entity
recognition, relation extraction, etc. The problem is often more challenging in
code-mixed documents wherein foreign languages words are drawn into base
language while framing the text. The word embeddings are powerful language
modeling tools for representation of text documents useful in obtaining
similarity between words or documents. We present a simple probabilistic
approach for building efficient word embedding for code-mixed text and
exemplifying it over language identification of Hindi-English short test
messages scrapped from Twitter. We examine its efficacy for the classification
task using bidirectional LSTMs and SVMs and observe its improved scores over
various existing code-mixed embeddings
Related papers
- A General and Flexible Multi-concept Parsing Framework for Multilingual Semantic Matching [60.51839859852572]
We propose to resolve the text into multi concepts for multilingual semantic matching to liberate the model from the reliance on NER models.
We conduct comprehensive experiments on English datasets QQP and MRPC, and Chinese dataset Medical-SM.
arXiv Detail & Related papers (2024-03-05T13:55:16Z) - Leveraging Language Identification to Enhance Code-Mixed Text
Classification [0.7340017786387767]
Existing deep-learning models do not take advantage of the implicit language information in code-mixed text.
Our study aims to improve BERT-based models performance on low-resource Code-Mixed Hindi-English datasets.
arXiv Detail & Related papers (2023-06-08T06:43:10Z) - Are the Best Multilingual Document Embeddings simply Based on Sentence
Embeddings? [18.968571816913208]
We provide a systematic comparison of methods to produce document-level representations from sentences based on LASER, LaBSE, and Sentence BERT pre-trained multilingual models.
We show that a clever combination of sentence embeddings is usually better than encoding the full document as a single unit.
arXiv Detail & Related papers (2023-04-28T12:11:21Z) - Language Lexicons for Hindi-English Multilingual Text Processing [0.0]
The present Language Identification techniques presume that a document contains text in one of the fixed set of languages.
Due to the unavailability of large standard corpora for Hindi-English mixed lingual language processing tasks we propose the language lexicons.
These lexicons are built by learning classifiers over transliterated Hindi and English vocabulary.
arXiv Detail & Related papers (2021-06-29T05:42:54Z) - Intrinsic Probing through Dimension Selection [69.52439198455438]
Most modern NLP systems make use of pre-trained contextual representations that attain astonishingly high performance on a variety of tasks.
Such high performance should not be possible unless some form of linguistic structure inheres in these representations, and a wealth of research has sprung up on probing for it.
In this paper, we draw a distinction between intrinsic probing, which examines how linguistic information is structured within a representation, and the extrinsic probing popular in prior work, which only argues for the presence of such information by showing that it can be successfully extracted.
arXiv Detail & Related papers (2020-10-06T15:21:08Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - Feature Selection on Noisy Twitter Short Text Messages for Language
Identification [0.0]
We apply different feature selection algorithms across various learning algorithms in order to analyze the effect of the algorithm.
The methodology focuses on the word level language identification using a novel dataset of 6903 tweets extracted from Twitter.
arXiv Detail & Related papers (2020-07-11T09:22:01Z) - A Multi-Perspective Architecture for Semantic Code Search [58.73778219645548]
We propose a novel multi-perspective cross-lingual neural framework for code--text matching.
Our experiments on the CoNaLa dataset show that our proposed model yields better performance than previous approaches.
arXiv Detail & Related papers (2020-05-06T04:46:11Z) - On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics.
Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings.
We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z) - On the Importance of Word Order Information in Cross-lingual Sequence
Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages.
We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.