Elevating Code-mixed Text Handling through Auditory Information of Words
- URL: http://arxiv.org/abs/2310.18155v1
- Date: Fri, 27 Oct 2023 14:03:30 GMT
- Title: Elevating Code-mixed Text Handling through Auditory Information of Words
- Authors: Mamta, Zishan Ahmad and Asif Ekbal
- Abstract summary: We propose an effective approach for creating language models for handling code-mixed textual data using auditory information of words from SOUNDEX.
Our approach includes a pre-training step based on masked-language-modelling, which includes SOUNDEX representations (SAMLM) and a new method of providing input data to the pre-trained model.
- Score: 24.53638976212391
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the growing popularity of code-mixed data, there is an increasing need
for better handling of this type of data, which poses a number of challenges,
such as dealing with spelling variations, multiple languages, different
scripts, and a lack of resources. Current language models face difficulty in
effectively handling code-mixed data as they primarily focus on the semantic
representation of words and ignore the auditory phonetic features. This leads
to difficulties in handling spelling variations in code-mixed text. In this
paper, we propose an effective approach for creating language models for
handling code-mixed textual data using auditory information of words from
SOUNDEX. Our approach includes a pre-training step based on
masked-language-modelling, which includes SOUNDEX representations (SAMLM) and a
new method of providing input data to the pre-trained model. Through
experimentation on various code-mixed datasets (of different languages) for
sentiment, offensive and aggression classification tasks, we establish that our
novel language modeling approach (SAMLM) results in improved robustness towards
adversarial attacks on code-mixed classification tasks. Additionally, our SAMLM
based approach also results in better classification results over the popular
baselines for code-mixed tasks. We use the explainability technique, SHAP
(SHapley Additive exPlanations) to explain how the auditory features
incorporated through SAMLM assist the model to handle the code-mixed text
effectively and increase robustness against adversarial attacks
\footnote{Source code has been made available on
\url{https://github.com/20118/DefenseWithPhonetics},
\url{https://www.iitp.ac.in/~ai-nlp-ml/resources.html\#Phonetics}}.
Related papers
- TransMI: A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data [50.40191599304911]
We propose Transliterate transliteration-Merge (TransMI), which can create a strong baseline well-suited for data that is transliterated into a common script.
Results show a consistent improvement of 3% to 34%, varying across different models and tasks.
arXiv Detail & Related papers (2024-05-16T09:08:09Z) - From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences [18.53327811304381]
Modelling human judgements for the acceptability of code-mixed text can help in distinguishing natural code-mixed text.
Cline is the largest of its kind with 16,642 sentences, consisting of samples sourced from two sources.
Experiments using Cline demonstrate that simple Multilayer Perceptron (MLP) models trained solely on code-mixing metrics are outperformed by fine-tuned Multilingual Large Language Models (MLLMs)
arXiv Detail & Related papers (2024-05-09T06:40:39Z) - A General and Flexible Multi-concept Parsing Framework for Multilingual Semantic Matching [60.51839859852572]
We propose to resolve the text into multi concepts for multilingual semantic matching to liberate the model from the reliance on NER models.
We conduct comprehensive experiments on English datasets QQP and MRPC, and Chinese dataset Medical-SM.
arXiv Detail & Related papers (2024-03-05T13:55:16Z) - FLIP: Fine-grained Alignment between ID-based Models and Pretrained Language Models for CTR Prediction [49.510163437116645]
Click-through rate (CTR) prediction plays as a core function module in personalized online services.
Traditional ID-based models for CTR prediction take as inputs the one-hot encoded ID features of tabular modality.
Pretrained Language Models(PLMs) has given rise to another paradigm, which takes as inputs the sentences of textual modality.
We propose to conduct Fine-grained feature-level ALignment between ID-based Models and Pretrained Language Models(FLIP) for CTR prediction.
arXiv Detail & Related papers (2023-10-30T11:25:03Z) - Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi [0.0]
We introduce Tri-Distil-BERT, a multilingual model pre-trained on Bangla, English, and Hindi, and Mixed-Distil-BERT, a model fine-tuned on code-mixed data.
Our two-tiered pre-training approach offers efficient alternatives for multilingual and code-mixed language understanding.
arXiv Detail & Related papers (2023-09-19T02:59:41Z) - Leveraging Language Identification to Enhance Code-Mixed Text
Classification [0.7340017786387767]
Existing deep-learning models do not take advantage of the implicit language information in code-mixed text.
Our study aims to improve BERT-based models performance on low-resource Code-Mixed Hindi-English datasets.
arXiv Detail & Related papers (2023-06-08T06:43:10Z) - SelfMix: Robust Learning Against Textual Label Noise with Self-Mixup
Training [15.877178854064708]
SelfMix is a simple yet effective method to handle label noise in text classification tasks.
Our method utilizes the dropout mechanism on a single model to reduce the confirmation bias in self-training.
arXiv Detail & Related papers (2022-10-10T09:46:40Z) - DoubleMix: Simple Interpolation-Based Data Augmentation for Text
Classification [56.817386699291305]
This paper proposes a simple yet effective data augmentation approach termed DoubleMix.
DoubleMix first generates several perturbed samples for each training data.
It then uses the perturbed data and original data to carry out a two-step in the hidden space of neural models.
arXiv Detail & Related papers (2022-09-12T15:01:04Z) - Revisiting Self-Training for Few-Shot Learning of Language Model [61.173976954360334]
Unlabeled data carry rich task-relevant information, they are proven useful for few-shot learning of language model.
In this work, we revisit the self-training technique for language model fine-tuning and present a state-of-the-art prompt-based few-shot learner, SFLM.
arXiv Detail & Related papers (2021-10-04T08:51:36Z) - A Simple and Efficient Probabilistic Language model for Code-Mixed Text [0.0]
We present a simple probabilistic approach for building efficient word embedding for code-mixed text.
We examine its efficacy for the classification task using bidirectional LSTMs and SVMs.
arXiv Detail & Related papers (2021-06-29T05:37:57Z) - COCO-LM: Correcting and Contrasting Text Sequences for Language Model
Pretraining [59.169836983883656]
COCO-LM is a new self-supervised learning framework that pretrains Language Models by COrrecting challenging errors and COntrasting text sequences.
COCO-LM employs an auxiliary language model to mask-and-predict tokens in original text sequences.
Our analyses reveal that COCO-LM's advantages come from its challenging training signals, more contextualized token representations, and regularized sequence representations.
arXiv Detail & Related papers (2021-02-16T22:24:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.