IndRegBias: A Dataset for Studying Indian Regional Biases in English and Code-Mixed Social Media Comments
- URL: http://arxiv.org/abs/2601.06477v2
- Date: Tue, 13 Jan 2026 06:53:27 GMT
- Title: IndRegBias: A Dataset for Studying Indian Regional Biases in English and Code-Mixed Social Media Comments
- Authors: Debasmita Panda, Akash Anil, Neelesh Kumar Shukla,
- Abstract summary: This paper focuses on creating a dataset IndRegBias, consisting of regional biases in an Indian context reflected in users' comments on popular social media platforms.<n>We carefully selected 25,000 comments appearing on various threads in Reddit and videos on YouTube discussing trending topics on regional issues in India.<n>To detect the presence of regional bias and its severity in IndRegBias, we evaluate open-source Large Language Models (LLMs) and Indic Language Models (ILMs) using zero-shot, few-shot, and fine-tuning strategies.
- Score: 0.1749935196721634
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Warning: This paper consists of examples representing regional biases in Indian regions that might be offensive towards a particular region. While social biases corresponding to gender, race, socio-economic conditions, etc., have been extensively studied in the major applications of Natural Language Processing (NLP), biases corresponding to regions have garnered less attention. This is mainly because of (i) difficulty in the extraction of regional bias datasets, (ii) disagreements in annotation due to inherent human biases, and (iii) regional biases being studied in combination with other types of social biases and often being under-represented. This paper focuses on creating a dataset IndRegBias, consisting of regional biases in an Indian context reflected in users' comments on popular social media platforms, namely Reddit and YouTube. We carefully selected 25,000 comments appearing on various threads in Reddit and videos on YouTube discussing trending topics on regional issues in India. Furthermore, we propose a multilevel annotation strategy to annotate the comments describing the severity of regional biased statements. To detect the presence of regional bias and its severity in IndRegBias, we evaluate open-source Large Language Models (LLMs) and Indic Language Models (ILMs) using zero-shot, few-shot, and fine-tuning strategies. We observe that zero-shot and few-shot approaches show lower accuracy in detecting regional biases and severity in the majority of the LLMs and ILMs. However, the fine-tuning approach significantly enhances the performance of the LLM in detecting Indian regional bias along with its severity.
Related papers
- IndiCASA: A Dataset and Bias Evaluation Framework in LLMs Using Contrastive Embedding Similarity in the Indian Context [10.90604216960609]
Large Language Models (LLMs) have gained significant traction across critical domains owing to their impressive contextual understanding and generative capabilities.<n>We propose an evaluation framework based on a encoder trained using contrastive learning that captures fine-grained bias through embedding similarity.<n>We also introduce a novel dataset - IndiCASA (IndiBias-based Contextually Aligned Stereotypes and Anti-stereotypes) comprising 2,575 human-validated sentences spanning five demographic axes: caste, gender, religion, disability, and socioeconomic status.
arXiv Detail & Related papers (2025-10-03T06:03:26Z) - PakBBQ: A Culturally Adapted Bias Benchmark for QA [3.4455728937232597]
We introduce PakBBQ, a culturally and regionally adapted extension of the original Bias Benchmark for Question Answering dataset.<n> PakBBQ comprises over 214 templates, 17180 QA pairs across 8 categories in both English and Urdu, covering eight bias dimensions including age, disability, appearance, gender, socio-economic status, religious, regional affiliation, and language formality that are relevant in Pakistan.
arXiv Detail & Related papers (2025-08-13T20:42:44Z) - Geopolitical biases in LLMs: what are the "good" and the "bad" countries according to contemporary language models [52.00270888041742]
We introduce a novel dataset with neutral event descriptions and contrasting viewpoints from different countries.<n>Our findings show significant geopolitical biases, with models favoring specific national narratives.<n>Simple debiasing prompts had a limited effect on reducing these biases.
arXiv Detail & Related papers (2025-06-07T10:45:17Z) - What's Not Said Still Hurts: A Description-Based Evaluation Framework for Measuring Social Bias in LLMs [8.219247185418821]
Large Language Models (LLMs) often exhibit social biases inherited from their training data.<n>We introduce the Description-based Bias Benchmark (DBB), a novel dataset designed to assess bias at the semantic level.<n>We analyze six state-of-the-art LLMs, revealing that while models reduce bias in response at the term level, they continue to reinforce biases in nuanced settings.
arXiv Detail & Related papers (2025-02-27T04:25:54Z) - Social Bias in Large Language Models For Bangla: An Empirical Study on Gender and Religious Bias [2.98683507969764]
It is important to assess the influence of different types of biases embedded in Large Language Models to ensure fair use in sensitive fields.<n>Although there have been extensive works on bias assessment in English, such efforts are rare and scarce for a major language like Bangla.<n>This is the first work of such kind involving bias assessment of LLMs for Bangla to the best of our knowledge.
arXiv Detail & Related papers (2024-07-03T22:45:36Z) - White Men Lead, Black Women Help? Benchmarking and Mitigating Language Agency Social Biases in LLMs [58.27353205269664]
Social biases can manifest in language agency in Large Language Model (LLM)-generated content.<n>We introduce the Language Agency Bias Evaluation benchmark, which comprehensively evaluates biases in LLMs.<n>Using LABE, we unveil language agency social biases in 3 recent LLMs: ChatGPT, Llama3, and Mistral.
arXiv Detail & Related papers (2024-04-16T12:27:54Z) - Large Language Models are Geographically Biased [47.88767211956144]
We study what Large Language Models (LLMs) know about the world we live in through the lens of geography.
We show various problematic geographic biases, which we define as systemic errors in geospatial predictions.
arXiv Detail & Related papers (2024-02-05T02:32:09Z) - GPTBIAS: A Comprehensive Framework for Evaluating Bias in Large Language
Models [83.30078426829627]
Large language models (LLMs) have gained popularity and are being widely adopted by a large user community.
The existing evaluation methods have many constraints, and their results exhibit a limited degree of interpretability.
We propose a bias evaluation framework named GPTBIAS that leverages the high performance of LLMs to assess bias in models.
arXiv Detail & Related papers (2023-12-11T12:02:14Z) - Exploring the Jungle of Bias: Political Bias Attribution in Language Models via Dependency Analysis [86.49858739347412]
Large Language Models (LLMs) have sparked intense debate regarding the prevalence of bias in these models and its mitigation.
We propose a prompt-based method for the extraction of confounding and mediating attributes which contribute to the decision process.
We find that the observed disparate treatment can at least in part be attributed to confounding and mitigating attributes and model misalignment.
arXiv Detail & Related papers (2023-11-15T00:02:25Z) - Indian-BhED: A Dataset for Measuring India-Centric Biases in Large Language Models [18.201326983938014]
Large Language Models (LLMs) can encode societal biases, exposing their users to representational harms.
We quantify stereotypical bias in popular LLMs according to an Indian-centric frame through Indian-BhED, a first of its kind dataset.
We find that the majority of LLMs tested have a strong propensity to output stereotypes in the Indian context.
arXiv Detail & Related papers (2023-09-15T17:38:41Z) - HERB: Measuring Hierarchical Regional Bias in Pre-trained Language
Models [33.0987914452712]
Regional bias in language models (LMs) is a long-standing global discrimination problem.
This paper bridges the gap by analysing the regional bias learned by the pre-trained language models.
We propose a HiErarchical Regional Bias evaluation method (HERB) to quantify the bias in pre-trained LMs.
arXiv Detail & Related papers (2022-11-05T11:30:57Z) - LOGAN: Local Group Bias Detection by Clustering [86.38331353310114]
We argue that evaluating bias at the corpus level is not enough for understanding how biases are embedded in a model.
We propose LOGAN, a new bias detection technique based on clustering.
Experiments on toxicity classification and object classification tasks show that LOGAN identifies bias in a local region.
arXiv Detail & Related papers (2020-10-06T16:42:51Z) - Towards Controllable Biases in Language Generation [87.89632038677912]
We develop a method to induce societal biases in generated text when input prompts contain mentions of specific demographic groups.
We analyze two scenarios: 1) inducing negative biases for one demographic and positive biases for another demographic, and 2) equalizing biases between demographics.
arXiv Detail & Related papers (2020-05-01T08:25:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.