Related papers: Unraveling Code-Mixing Patterns in Migration Discourse: Automated Detection and Analysis of Online Conversations on Reddit

Unraveling Code-Mixing Patterns in Migration Discourse: Automated Detection and Analysis of Online Conversations on Reddit

URL: http://arxiv.org/abs/2406.08633v1
Date: Wed, 12 Jun 2024 20:30:34 GMT
Title: Unraveling Code-Mixing Patterns in Migration Discourse: Automated Detection and Analysis of Online Conversations on Reddit
Authors: Fedor Vitiugin, Sunok Lee, Henna Paakki, Anastasiia Chizhikova, Nitin Sawhney,
Abstract summary: This paper explores the utilization of code-mixing, a communication strategy prevalent among multilingual speakers, in migration-related discourse on social media platforms such as Reddit. We present Ensemble Learning for Identification of Code-mixed Texts (ELMICT), a novel approach designed to automatically detect code-mixed messages in migration-related discussions.
Score: 4.019533549688538
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The surge in global migration patterns underscores the imperative of integrating migrants seamlessly into host communities, necessitating inclusive and trustworthy public services. Despite the Nordic countries' robust public sector infrastructure, recent immigrants often encounter barriers to accessing these services, exacerbating social disparities and eroding trust. Addressing digital inequalities and linguistic diversity is paramount in this endeavor. This paper explores the utilization of code-mixing, a communication strategy prevalent among multilingual speakers, in migration-related discourse on social media platforms such as Reddit. We present Ensemble Learning for Multilingual Identification of Code-mixed Texts (ELMICT), a novel approach designed to automatically detect code-mixed messages in migration-related discussions. Leveraging ensemble learning techniques for combining multiple tokenizers' outputs and pre-trained language models, ELMICT demonstrates high performance (with F1 more than 0.95) in identifying code-mixing across various languages and contexts, particularly in cross-lingual zero-shot conditions (with avg. F1 more than 0.70). Moreover, the utilization of ELMICT helps to analyze the prevalence of code-mixing in migration-related threads compared to other thematic categories on Reddit, shedding light on the topics of concern to migrant communities. Our findings reveal insights into the communicative strategies employed by migrants on social media platforms, offering implications for the development of inclusive digital public services and conversational systems. By addressing the research questions posed in this study, we contribute to the understanding of linguistic diversity in migration discourse and pave the way for more effective tools for building trust in multicultural societies.

Related papers

When Meanings Meet: Investigating the Emergence and Quality of Shared Concept Spaces during Multilingual Language Model Training [57.230355403478995]
We investigate the development of language-agnostic concept spaces during pretraining of EuroLLM.<n>We find that shared concept spaces emerge early and continue to refine, but that alignment with them is language-dependent.<n>In contrast to prior work, our fine-grained manual analysis reveals that some apparent gains in translation quality reflect shifts in behavior.
arXiv Detail & Related papers (2026-01-30T11:23:01Z)
MASim: Multilingual Agent-Based Simulation for Social Science [68.04129327237963]
Multi-agent role-playing has recently shown promise for studying social behavior with language agents.<n>Existing simulations are mostly monolingual and fail to model cross-lingual interaction.<n>We introduce MASim, the first multilingual agent-based simulation framework.
arXiv Detail & Related papers (2025-12-08T06:12:48Z)
Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages [57.059267233093465]
Large Language Models (LLMs) have transformed natural language processing, but their safety mechanisms remain under-explored in low-resource, multilingual settings.<n>We introduce textsfSGToxicGuard, a novel dataset and evaluation framework for benchmarking LLM safety in Singapore's diverse linguistic context.<n>We conduct extensive experiments with state-of-the-art multilingual LLMs, and the results uncover critical gaps in their safety guardrails.
arXiv Detail & Related papers (2025-09-18T08:14:34Z)
Disentangling Codemixing in Chats: The NUS ABC Codemixed Corpus [11.518751071307745]
Code-mixing involves the seamless integration of linguistic elements from multiple languages within a single discourse.<n>There has been a lack of publicly available corpora that are author-labeled and suitable for modeling human conversations and relationships.<n>This study introduces the first labeled and general-purpose corpus for understanding code-mixing in context while maintaining rigorous privacy and ethical standards.
arXiv Detail & Related papers (2025-05-31T01:09:04Z)
SwitchLingua: The First Large-Scale Multilingual and Multi-Ethnic Code-Switching Dataset [34.40254709148148]
Code-Switching (CS) is the alternating use of two or more languages within a conversation or utterance.<n>This linguistic phenomenon poses challenges for Automatic Speech Recognition (ASR) systems.<n>textbfSwitchLingua is the first large-scale multilingual and multi-ethnic code-switching dataset.
arXiv Detail & Related papers (2025-05-30T05:54:46Z)
Creating and Evaluating Code-Mixed Nepali-English and Telugu-English Datasets for Abusive Language Detection Using Traditional and Deep Learning Models [1.835004446596942]
We introduce a novel, manually annotated dataset of 2 thousand Telugu-English and 5 Nepali-English code-mixed comments.<n>The dataset undergoes rigorous preprocessing before being evaluated across multiple Machine Learning (ML), Deep Learning (DL), and Large Language Models (LLMs)<n>Our findings provide key insights into the challenges of detecting abusive language in code-mixed settings.
arXiv Detail & Related papers (2025-04-23T11:29:10Z)
MR. Guard: Multilingual Reasoning Guardrail using Curriculum Learning [56.79292318645454]
Large Language Models (LLMs) are susceptible to adversarial attacks such as jailbreaking. This vulnerability is exacerbated in multilingual setting, where multilingual safety-aligned data are often limited. We propose an approach to build a multilingual guardrail with reasoning.
arXiv Detail & Related papers (2025-04-21T17:15:06Z)
RetrieveGPT: Merging Prompts and Mathematical Models for Enhanced Code-Mixed Information Retrieval [0.0]
In India, social media users frequently engage in code-mixed conversations using the Roman script. This paper focuses on the challenges of extracting relevant information from code-mixed conversations. We develop a mechanism to automatically identify the most relevant answers from code-mixed conversations.
arXiv Detail & Related papers (2024-11-07T14:41:01Z)
Language Model Alignment in Multilingual Trolley Problems [138.5684081822807]
Building on the Moral Machine experiment, we develop a cross-lingual corpus of moral dilemma vignettes in over 100 languages called MultiTP. Our analysis explores the alignment of 19 different LLMs with human judgments, capturing preferences across six moral dimensions. We discover significant variance in alignment across languages, challenging the assumption of uniform moral reasoning in AI systems.
arXiv Detail & Related papers (2024-07-02T14:02:53Z)
A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers [48.314619377988436]
The rapid development of Large Language Models (LLMs) demonstrates remarkable multilingual capabilities in natural language processing. Despite the breakthroughs of LLMs, the investigation into the multilingual scenario remains insufficient. This survey aims to help the research community address multilingual problems and provide a comprehensive understanding of the core concepts, key techniques, and latest developments in multilingual natural language processing based on LLMs.
arXiv Detail & Related papers (2024-05-17T17:47:39Z)
Countering Malicious Content Moderation Evasion in Online Social Networks: Simulation and Detection of Word Camouflage [64.78260098263489]
Twisting and camouflaging keywords are among the most used techniques to evade platform content moderation systems. This article contributes significantly to countering malicious information by developing multilingual tools to simulate and detect new methods of evasion of content.
arXiv Detail & Related papers (2022-12-27T16:08:49Z)
Cross-lingual Lifelong Learning [53.06904052325966]
We present a principled Cross-lingual Continual Learning (CCL) evaluation paradigm. We provide insights into what makes multilingual sequential learning particularly challenging. The implications of this analysis include a recipe for how to measure and balance different cross-lingual continual learning desiderata.
arXiv Detail & Related papers (2022-05-23T09:25:43Z)
A Comprehensive Understanding of Code-mixed Language Semantics using Hierarchical Transformer [28.3684494647968]
We propose a hierarchical transformer-based architecture (HIT) to learn the semantics of code-mixed languages. We evaluate the proposed method across 6 Indian languages and 9 NLP tasks on 17 datasets.
arXiv Detail & Related papers (2022-04-27T07:50:18Z)
Challenges and Considerations with Code-Mixed NLP for Multilingual Societies [1.6675267471157407]
This paper discusses the current state of the NLP research, limitations, and foreseeable pitfalls in addressing five real-world applications for social good. We also propose futuristic datasets, models, and tools that can significantly advance the current research in multilingual NLP applications for the societal good.
arXiv Detail & Related papers (2021-06-15T00:53:55Z)
X-METRA-ADA: Cross-lingual Meta-Transfer Learning Adaptation to Natural Language Understanding and Question Answering [55.57776147848929]
We propose X-METRA-ADA, a cross-lingual MEta-TRAnsfer learning ADAptation approach for Natural Language Understanding (NLU) Our approach adapts MAML, an optimization-based meta-learning approach, to learn to adapt to new languages. We show that our approach outperforms naive fine-tuning, reaching competitive performance on both tasks for most languages.
arXiv Detail & Related papers (2021-04-20T00:13:35Z)
Characterizing English Variation across Social Media Communities with BERT [9.98785450861229]
We analyze two months of English comments in 474 Reddit communities. The specificity of different sense clusters to a community, combined with the specificity of a community's unique word types, is used to identify cases where a social group's language deviates from the norm. We find that communities with highly distinctive language are medium-sized, and their loyal and highly engaged users interact in dense networks.
arXiv Detail & Related papers (2021-02-12T23:50:57Z)
Migratable AI: Personalizing Dialog Conversations with migration context [25.029958885340058]
We collected a dataset from the dialog conversations between crowdsourced workers with the migration context. We trained the generative and information retrieval models on the dataset using with and without migration context. We believe that the migration dataset would be useful for training future migratable AI systems.
arXiv Detail & Related papers (2020-10-22T22:23:03Z)
On Negative Interference in Multilingual Models: Findings and A Meta-Learning Treatment [59.995385574274785]
We show that, contrary to previous belief, negative interference also impacts low-resource languages. We present a meta-learning algorithm that obtains better cross-lingual transferability and alleviates negative interference.
arXiv Detail & Related papers (2020-10-06T20:48:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.