Unraveling Code-Mixing Patterns in Migration Discourse: Automated Detection and Analysis of Online Conversations on Reddit
- URL: http://arxiv.org/abs/2406.08633v1
- Date: Wed, 12 Jun 2024 20:30:34 GMT
- Title: Unraveling Code-Mixing Patterns in Migration Discourse: Automated Detection and Analysis of Online Conversations on Reddit
- Authors: Fedor Vitiugin, Sunok Lee, Henna Paakki, Anastasiia Chizhikova, Nitin Sawhney,
- Abstract summary: This paper explores the utilization of code-mixing, a communication strategy prevalent among multilingual speakers, in migration-related discourse on social media platforms such as Reddit.
We present Ensemble Learning for Identification of Code-mixed Texts (ELMICT), a novel approach designed to automatically detect code-mixed messages in migration-related discussions.
- Score: 4.019533549688538
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The surge in global migration patterns underscores the imperative of integrating migrants seamlessly into host communities, necessitating inclusive and trustworthy public services. Despite the Nordic countries' robust public sector infrastructure, recent immigrants often encounter barriers to accessing these services, exacerbating social disparities and eroding trust. Addressing digital inequalities and linguistic diversity is paramount in this endeavor. This paper explores the utilization of code-mixing, a communication strategy prevalent among multilingual speakers, in migration-related discourse on social media platforms such as Reddit. We present Ensemble Learning for Multilingual Identification of Code-mixed Texts (ELMICT), a novel approach designed to automatically detect code-mixed messages in migration-related discussions. Leveraging ensemble learning techniques for combining multiple tokenizers' outputs and pre-trained language models, ELMICT demonstrates high performance (with F1 more than 0.95) in identifying code-mixing across various languages and contexts, particularly in cross-lingual zero-shot conditions (with avg. F1 more than 0.70). Moreover, the utilization of ELMICT helps to analyze the prevalence of code-mixing in migration-related threads compared to other thematic categories on Reddit, shedding light on the topics of concern to migrant communities. Our findings reveal insights into the communicative strategies employed by migrants on social media platforms, offering implications for the development of inclusive digital public services and conversational systems. By addressing the research questions posed in this study, we contribute to the understanding of linguistic diversity in migration discourse and pave the way for more effective tools for building trust in multicultural societies.
Related papers
- RetrieveGPT: Merging Prompts and Mathematical Models for Enhanced Code-Mixed Information Retrieval [0.0]
In India, social media users frequently engage in code-mixed conversations using the Roman script.
This paper focuses on the challenges of extracting relevant information from code-mixed conversations.
We develop a mechanism to automatically identify the most relevant answers from code-mixed conversations.
arXiv Detail & Related papers (2024-11-07T14:41:01Z) - Language Model Alignment in Multilingual Trolley Problems [138.5684081822807]
Building on the Moral Machine experiment, we develop a cross-lingual corpus of moral dilemma vignettes in over 100 languages called MultiTP.
Our analysis explores the alignment of 19 different LLMs with human judgments, capturing preferences across six moral dimensions.
We discover significant variance in alignment across languages, challenging the assumption of uniform moral reasoning in AI systems.
arXiv Detail & Related papers (2024-07-02T14:02:53Z) - A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers [48.314619377988436]
The rapid development of Large Language Models (LLMs) demonstrates remarkable multilingual capabilities in natural language processing.
Despite the breakthroughs of LLMs, the investigation into the multilingual scenario remains insufficient.
This survey aims to help the research community address multilingual problems and provide a comprehensive understanding of the core concepts, key techniques, and latest developments in multilingual natural language processing based on LLMs.
arXiv Detail & Related papers (2024-05-17T17:47:39Z) - Countering Malicious Content Moderation Evasion in Online Social
Networks: Simulation and Detection of Word Camouflage [64.78260098263489]
Twisting and camouflaging keywords are among the most used techniques to evade platform content moderation systems.
This article contributes significantly to countering malicious information by developing multilingual tools to simulate and detect new methods of evasion of content.
arXiv Detail & Related papers (2022-12-27T16:08:49Z) - Cross-lingual Lifelong Learning [53.06904052325966]
We present a principled Cross-lingual Continual Learning (CCL) evaluation paradigm.
We provide insights into what makes multilingual sequential learning particularly challenging.
The implications of this analysis include a recipe for how to measure and balance different cross-lingual continual learning desiderata.
arXiv Detail & Related papers (2022-05-23T09:25:43Z) - A Comprehensive Understanding of Code-mixed Language Semantics using
Hierarchical Transformer [28.3684494647968]
We propose a hierarchical transformer-based architecture (HIT) to learn the semantics of code-mixed languages.
We evaluate the proposed method across 6 Indian languages and 9 NLP tasks on 17 datasets.
arXiv Detail & Related papers (2022-04-27T07:50:18Z) - Challenges and Considerations with Code-Mixed NLP for Multilingual
Societies [1.6675267471157407]
This paper discusses the current state of the NLP research, limitations, and foreseeable pitfalls in addressing five real-world applications for social good.
We also propose futuristic datasets, models, and tools that can significantly advance the current research in multilingual NLP applications for the societal good.
arXiv Detail & Related papers (2021-06-15T00:53:55Z) - X-METRA-ADA: Cross-lingual Meta-Transfer Learning Adaptation to Natural
Language Understanding and Question Answering [55.57776147848929]
We propose X-METRA-ADA, a cross-lingual MEta-TRAnsfer learning ADAptation approach for Natural Language Understanding (NLU)
Our approach adapts MAML, an optimization-based meta-learning approach, to learn to adapt to new languages.
We show that our approach outperforms naive fine-tuning, reaching competitive performance on both tasks for most languages.
arXiv Detail & Related papers (2021-04-20T00:13:35Z) - Characterizing English Variation across Social Media Communities with
BERT [9.98785450861229]
We analyze two months of English comments in 474 Reddit communities.
The specificity of different sense clusters to a community, combined with the specificity of a community's unique word types, is used to identify cases where a social group's language deviates from the norm.
We find that communities with highly distinctive language are medium-sized, and their loyal and highly engaged users interact in dense networks.
arXiv Detail & Related papers (2021-02-12T23:50:57Z) - Migratable AI: Personalizing Dialog Conversations with migration context [25.029958885340058]
We collected a dataset from the dialog conversations between crowdsourced workers with the migration context.
We trained the generative and information retrieval models on the dataset using with and without migration context.
We believe that the migration dataset would be useful for training future migratable AI systems.
arXiv Detail & Related papers (2020-10-22T22:23:03Z) - On Negative Interference in Multilingual Models: Findings and A
Meta-Learning Treatment [59.995385574274785]
We show that, contrary to previous belief, negative interference also impacts low-resource languages.
We present a meta-learning algorithm that obtains better cross-lingual transferability and alleviates negative interference.
arXiv Detail & Related papers (2020-10-06T20:48:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.