Related papers: Disentangling Codemixing in Chats: The NUS ABC Codemixed Corpus

Disentangling Codemixing in Chats: The NUS ABC Codemixed Corpus

URL: http://arxiv.org/abs/2506.00332v2
Date: Mon, 16 Jun 2025 01:12:52 GMT
Title: Disentangling Codemixing in Chats: The NUS ABC Codemixed Corpus
Authors: Svetlana Churina, Akshat Gupta, Insyirah Mujtahid, Kokil Jaidka,
Abstract summary: Code-mixing involves the seamless integration of linguistic elements from multiple languages within a single discourse.<n>There has been a lack of publicly available corpora that are author-labeled and suitable for modeling human conversations and relationships.<n>This study introduces the first labeled and general-purpose corpus for understanding code-mixing in context while maintaining rigorous privacy and ethical standards.
Score: 11.518751071307745
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Code-mixing involves the seamless integration of linguistic elements from multiple languages within a single discourse, reflecting natural multilingual communication patterns. Despite its prominence in informal interactions such as social media, chat messages and instant-messaging exchanges, there has been a lack of publicly available corpora that are author-labeled and suitable for modeling human conversations and relationships. This study introduces the first labeled and general-purpose corpus for understanding code-mixing in context while maintaining rigorous privacy and ethical standards. Our live project will continuously gather, verify, and integrate code-mixed messages into a structured dataset released in JSON format, accompanied by detailed metadata and linguistic statistics. To date, it includes over 355,641 messages spanning various code-mixing patterns, with a primary focus on English, Mandarin, and other languages. We expect the Codemix Corpus to serve as a foundational dataset for research in computational linguistics, sociolinguistics, and NLP applications.

Related papers

RetrieveGPT: Merging Prompts and Mathematical Models for Enhanced Code-Mixed Information Retrieval [0.0]
In India, social media users frequently engage in code-mixed conversations using the Roman script.<n>This paper focuses on the challenges of extracting relevant information from code-mixed conversations.<n>We develop a mechanism to automatically identify the most relevant answers from code-mixed conversations.
arXiv Detail & Related papers (2024-11-07T14:41:01Z)
Modular Sentence Encoders: Separating Language Specialization from Cross-Lingual Alignment [50.80949663719335]
multilingual sentence encoders (MSEs) are commonly obtained by training multilingual language models to map sentences from different languages into a shared semantic space.<n>MSEs are subject to curse of multilinguality, a loss of monolingual representational accuracy due to parameter sharing.<n>We train the cross-lingual adapters with two different types of data to resolve the conflicting requirements of different cross-lingual tasks.
arXiv Detail & Related papers (2024-07-20T13:56:39Z)
CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text. We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules. COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z)
A Corpus for Sentence-level Subjectivity Detection on English News Articles [49.49218203204942]
We use our guidelines to collect NewsSD-ENG, a corpus of 638 objective and 411 subjective sentences extracted from English news articles on controversial topics. Our corpus paves the way for subjectivity detection in English without relying on language-specific tools, such as lexicons or machine translation.
arXiv Detail & Related papers (2023-05-29T11:54:50Z)
Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages [47.78634360870564]
We explore prompting multilingual models to generate code-mixed data for seven languages in South East Asia (SEA) We find that publicly available multilingual instruction-tuned models such as BLOOMZ are incapable of producing texts with phrases or clauses from different languages. ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing.
arXiv Detail & Related papers (2023-03-23T18:16:30Z)
Transformer-based Model for Word Level Language Identification in Code-mixed Kannada-English Texts [55.41644538483948]
We propose the use of a Transformer based model for word-level language identification in code-mixed Kannada English texts. The proposed model on the CoLI-Kenglish dataset achieves a weighted F1-score of 0.84 and a macro F1-score of 0.61.
arXiv Detail & Related papers (2022-11-26T02:39:19Z)
CLSE: Corpus of Linguistically Significant Entities [58.29901964387952]
We release a Corpus of Linguistically Significant Entities (CLSE) annotated by experts. CLSE covers 74 different semantic types to support various applications from airline ticketing to video games. We create a linguistically representative NLG evaluation benchmark in three languages: French, Marathi, and Russian.
arXiv Detail & Related papers (2022-11-04T12:56:12Z)
Challenges and Considerations with Code-Mixed NLP for Multilingual Societies [1.6675267471157407]
This paper discusses the current state of the NLP research, limitations, and foreseeable pitfalls in addressing five real-world applications for social good. We also propose futuristic datasets, models, and tools that can significantly advance the current research in multilingual NLP applications for the societal good.
arXiv Detail & Related papers (2021-06-15T00:53:55Z)
GupShup: An Annotated Corpus for Abstractive Summarization of Open-Domain Code-Switched Conversations [28.693328393260906]
We introduce abstractive summarization of Hindi-English code-switched conversations and develop the first code-switched conversation summarization dataset. GupShup contains over 6,831 conversations in Hindi-English and their corresponding human-annotated summaries in English and Hindi-English. We train state-of-the-art abstractive summarization models and report their performances using both automated metrics and human evaluation.
arXiv Detail & Related papers (2021-04-17T15:42:01Z)
NUIG-Shubhanker@Dravidian-CodeMix-FIRE2020: Sentiment Analysis of Code-Mixed Dravidian text using XLNet [0.0]
Social media has penetrated into multilingual societies, however most of them use English to be a preferred language for communication. It looks natural for them to mix their cultural language with English during conversations resulting in abundance of multilingual data, call this code-mixed data, available in todays' world. Downstream NLP tasks using such data is challenging due to the semantic nature of it being spread across multiple languages. This paper uses an auto-regressive XLNet model to perform sentiment analysis on code-mixed Tamil-English and Malayalam-English datasets.
arXiv Detail & Related papers (2020-10-15T14:09:02Z)
IIT Gandhinagar at SemEval-2020 Task 9: Code-Mixed Sentiment Classification Using Candidate Sentence Generation and Selection [1.2301855531996841]
Code-mixing adds to the challenge of analyzing the sentiment of the text due to the non-standard writing style. We present a candidate sentence generation and selection based approach on top of the Bi-LSTM based neural classifier. The proposed approach shows an improvement in the system performance as compared to the Bi-LSTM based neural classifier.
arXiv Detail & Related papers (2020-06-25T14:59:47Z)
A Multi-Perspective Architecture for Semantic Code Search [58.73778219645548]
We propose a novel multi-perspective cross-lingual neural framework for code--text matching. Our experiments on the CoNaLa dataset show that our proposed model yields better performance than previous approaches.
arXiv Detail & Related papers (2020-05-06T04:46:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.