Are Language Models Agnostic to Linguistically Grounded Perturbations? A Case Study of Indic Languages
- URL: http://arxiv.org/abs/2412.10805v1
- Date: Sat, 14 Dec 2024 12:10:38 GMT
- Title: Are Language Models Agnostic to Linguistically Grounded Perturbations? A Case Study of Indic Languages
- Authors: Poulami Ghosh, Raj Dabre, Pushpak Bhattacharyya,
- Abstract summary: We study whether pre-trained language models are agnostic to linguistically grounded attacks or not.
Our findings reveal that although PLMs are susceptible to linguistic perturbations, when compared to non-linguistic attacks, PLMs exhibit a slightly lower susceptibility to linguistic attacks.
- Score: 47.45957604683302
- License:
- Abstract: Pre-trained language models (PLMs) are known to be susceptible to perturbations to the input text, but existing works do not explicitly focus on linguistically grounded attacks, which are subtle and more prevalent in nature. In this paper, we study whether PLMs are agnostic to linguistically grounded attacks or not. To this end, we offer the first study addressing this, investigating different Indic languages and various downstream tasks. Our findings reveal that although PLMs are susceptible to linguistic perturbations, when compared to non-linguistic attacks, PLMs exhibit a slightly lower susceptibility to linguistic attacks. This highlights that even constrained attacks are effective. Moreover, we investigate the implications of these outcomes across a range of languages, encompassing diverse language families and different scripts.
Related papers
- Can Language Models Learn Typologically Implausible Languages? [62.823015163987996]
Grammatical features across human languages show intriguing correlations often attributed to learning biases in humans.
We discuss how language models (LMs) allow us to better determine the role of domain-general learning biases in language universals.
We test LMs on an array of highly naturalistic but counterfactual versions of the English (head-initial) and Japanese (head-final) languages.
arXiv Detail & Related papers (2025-02-17T20:40:01Z) - Large Language Models are Easily Confused: A Quantitative Metric, Security Implications and Typological Analysis [5.029635172046762]
Language Confusion is a phenomenon where Large Language Models (LLMs) generate text that is neither in the desired language, nor in a contextually appropriate language.
We introduce a novel metric, Language Confusion Entropy, designed to measure and quantify this confusion.
arXiv Detail & Related papers (2024-10-17T05:43:30Z) - The Lou Dataset -- Exploring the Impact of Gender-Fair Language in German Text Classification [57.06913662622832]
Gender-fair language fosters inclusion by addressing all genders or using neutral forms.
Gender-fair language substantially impacts predictions by flipping labels, reducing certainty, and altering attention patterns.
While we offer initial insights on the effect on German text classification, the findings likely apply to other languages.
arXiv Detail & Related papers (2024-09-26T15:08:17Z) - Against All Odds: Overcoming Typology, Script, and Language Confusion in Multilingual Embedding Inversion Attacks [3.2297018268473665]
Large Language Models (LLMs) are susceptible to malicious influence by cyber attackers through intrusions such as adversarial, backdoor, and embedding inversion attacks.
This study explores the security of multilingual LLMs in the context of embedding inversion attacks and investigates cross-lingual and cross-script inversion across 20 languages.
Our findings indicate that languages written in Arabic script and Cyrillic script are particularly vulnerable to embedding inversion, as are languages within the Indo-Aryan language family.
arXiv Detail & Related papers (2024-08-21T16:16:34Z) - Understanding and Mitigating Language Confusion in LLMs [76.96033035093204]
We evaluate 15 typologically diverse languages with existing and newly-created English and multilingual prompts.
We find that Llama Instruct and Mistral models exhibit high degrees of language confusion.
We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning.
arXiv Detail & Related papers (2024-06-28T17:03:51Z) - Breaking Boundaries: Investigating the Effects of Model Editing on Cross-linguistic Performance [6.907734681124986]
This paper strategically identifies the need for linguistic equity by examining several knowledge editing techniques in multilingual contexts.
We evaluate the performance of models such as Mistral, TowerInstruct, OpenHathi, Tamil-Llama, and Kan-Llama across languages including English, German, French, Italian, Spanish, Hindi, Tamil, and Kannada.
arXiv Detail & Related papers (2024-06-17T01:54:27Z) - Text Embedding Inversion Security for Multilingual Language Models [2.790855523145802]
Research shows that text can be reconstructed from embeddings, even without knowledge of the underlying model.
This study is the first to investigate multilingual inversion attacks, shedding light on the differences in attacks and defenses across monolingual and multilingual settings.
arXiv Detail & Related papers (2024-01-22T18:34:42Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - Data Bootstrapping Approaches to Improve Low Resource Abusive Language
Detection for Indic Languages [5.51252705016179]
We demonstrate a large-scale analysis of multilingual abusive speech in Indic languages.
We examine different interlingual transfer mechanisms and observe the performance of various multilingual models for abusive speech detection.
arXiv Detail & Related papers (2022-04-26T18:56:01Z) - On Negative Interference in Multilingual Models: Findings and A
Meta-Learning Treatment [59.995385574274785]
We show that, contrary to previous belief, negative interference also impacts low-resource languages.
We present a meta-learning algorithm that obtains better cross-lingual transferability and alleviates negative interference.
arXiv Detail & Related papers (2020-10-06T20:48:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.