Evaluating Input Representation for Language Identification in
Hindi-English Code Mixed Text
- URL: http://arxiv.org/abs/2011.11263v2
- Date: Wed, 25 Nov 2020 13:22:04 GMT
- Title: Evaluating Input Representation for Language Identification in
Hindi-English Code Mixed Text
- Authors: Ramchandra Joshi, Raviraj Joshi
- Abstract summary: Code-mixed text comprises text written in more than one language.
People naturally tend to combine local language with global languages like English.
In this work, we focus on language identification in code-mixed sentences for Hindi-English mixed text.
- Score: 4.4904382374090765
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Natural language processing (NLP) techniques have become mainstream in the
recent decade. Most of these advances are attributed to the processing of a
single language. More recently, with the extensive growth of social media
platforms focus has shifted to code-mixed text. The code-mixed text comprises
text written in more than one language. People naturally tend to combine local
language with global languages like English. To process such texts, current NLP
techniques are not sufficient. As a first step, the text is processed to
identify the language of the words in the text. In this work, we focus on
language identification in code-mixed sentences for Hindi-English mixed text.
The task of language identification is formulated as a token classification
task. In the supervised setting, each word in the sentence has an associated
language label. We evaluate different deep learning models and input
representation combinations for this task. Mainly, character, sub-word, and
word embeddings are considered in combination with CNN and LSTM based models.
We show that sub-word representation along with the LSTM model gives the best
results. In general sub-word representations perform significantly better than
other input representations. We report the best accuracy of 94.52% using a
single layer LSTM model on the standard SAIL ICON 2017 test set.
Related papers
- Introducing Syllable Tokenization for Low-resource Languages: A Case Study with Swahili [29.252250069388687]
Tokenization allows for the words to be split based on characters or subwords, creating word embeddings that best represent the structure of the language.
We propose a syllable tokenizer and adopt an experiment-centric approach to validate the proposed tokenizer based on the Swahili language.
arXiv Detail & Related papers (2024-03-26T17:26:50Z) - MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling [70.34758460372629]
We introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages.
MYTE produces shorter encodings for all 99 analyzed languages.
This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.
arXiv Detail & Related papers (2024-03-15T21:21:11Z) - Transformer-based Model for Word Level Language Identification in
Code-mixed Kannada-English Texts [55.41644538483948]
We propose the use of a Transformer based model for word-level language identification in code-mixed Kannada English texts.
The proposed model on the CoLI-Kenglish dataset achieves a weighted F1-score of 0.84 and a macro F1-score of 0.61.
arXiv Detail & Related papers (2022-11-26T02:39:19Z) - CoLI-Machine Learning Approaches for Code-mixed Language Identification
at the Word Level in Kannada-English Texts [0.0]
Many Indians especially youths are comfortable with Hindi and English, in addition to their local languages. Hence, they often use more than one language to post their comments on social media.
Code-mixed Kn-En texts are extracted from YouTube video comments to construct CoLI-Kenglish dataset and code-mixed Kn-En embedding.
The words in CoLI-Kenglish dataset are grouped into six major categories, namely, "Kannada", "English", "Mixed-language", "Name", "Location" and "Other.
arXiv Detail & Related papers (2022-11-17T19:16:56Z) - Language Identification of Hindi-English tweets using code-mixed BERT [0.0]
The work utilizes a data collection of Hindi-English-Urdu codemixed text for language pre-training and Hindi-English codemixed for subsequent word-level language classification.
The results show that the representations pre-trained over codemixed data produce better results by their monolingual counterpart.
arXiv Detail & Related papers (2021-07-02T17:51:36Z) - A Simple and Efficient Probabilistic Language model for Code-Mixed Text [0.0]
We present a simple probabilistic approach for building efficient word embedding for code-mixed text.
We examine its efficacy for the classification task using bidirectional LSTMs and SVMs.
arXiv Detail & Related papers (2021-06-29T05:37:57Z) - Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word
Alignment [49.45399359826453]
Cross-lingual language models are typically pretrained with language modeling on multilingual text or parallel sentences.
We introduce denoising word alignment as a new cross-lingual pre-training task.
Experimental results show that our method improves cross-lingual transferability on various datasets.
arXiv Detail & Related papers (2021-06-11T13:36:01Z) - Revisiting Language Encoding in Learning Multilingual Representations [70.01772581545103]
We propose a new approach called Cross-lingual Language Projection (XLP) to replace language embedding.
XLP projects the word embeddings into language-specific semantic space, and then the projected embeddings will be fed into the Transformer model.
Experiments show that XLP can freely and significantly boost the model performance on extensive multilingual benchmark datasets.
arXiv Detail & Related papers (2021-02-16T18:47:10Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - IIT Gandhinagar at SemEval-2020 Task 9: Code-Mixed Sentiment
Classification Using Candidate Sentence Generation and Selection [1.2301855531996841]
Code-mixing adds to the challenge of analyzing the sentiment of the text due to the non-standard writing style.
We present a candidate sentence generation and selection based approach on top of the Bi-LSTM based neural classifier.
The proposed approach shows an improvement in the system performance as compared to the Bi-LSTM based neural classifier.
arXiv Detail & Related papers (2020-06-25T14:59:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.