Code-Mixed Probes Show How Pre-Trained Models Generalise On Code-Switched Text
- URL: http://arxiv.org/abs/2403.04872v2
- Date: Tue, 7 May 2024 16:38:25 GMT
- Title: Code-Mixed Probes Show How Pre-Trained Models Generalise On Code-Switched Text
- Authors: Frances A. Laureano De Leon, Harish Tayyar Madabushi, Mark Lee,
- Abstract summary: We investigate how pre-trained Language Models handle code-switched text in three dimensions.
Our findings reveal that pre-trained language models are effective in generalising to code-switched text.
- Score: 1.9185059111021852
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Code-switching is a prevalent linguistic phenomenon in which multilingual individuals seamlessly alternate between languages. Despite its widespread use online and recent research trends in this area, research in code-switching presents unique challenges, primarily stemming from the scarcity of labelled data and available resources. In this study we investigate how pre-trained Language Models handle code-switched text in three dimensions: a) the ability of PLMs to detect code-switched text, b) variations in the structural information that PLMs utilise to capture code-switched text, and c) the consistency of semantic information representation in code-switched text. To conduct a systematic and controlled evaluation of the language models in question, we create a novel dataset of well-formed naturalistic code-switched text along with parallel translations into the source languages. Our findings reveal that pre-trained language models are effective in generalising to code-switched text, shedding light on the abilities of these models to generalise representations to CS corpora. We release all our code and data including the novel corpus at https://github.com/francesita/code-mixed-probes.
Related papers
- IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators [49.903001442804594]
This work investigates the prospect of leveraging compiler intermediate representations (IR) to improve the multilingual capabilities of Code-LMs.
We first compile SLTrans, a parallel dataset consisting of nearly 4M self-contained source code files.
Next, we carry out continued causal language modelling training on SLTrans, forcing the Code-LMs to learn the IR language.
Our resulting models, dubbed IRCoder, display sizeable and consistent gains across a wide variety of code generation tasks and metrics.
arXiv Detail & Related papers (2024-03-06T17:52:08Z) - Share What You Already Know: Cross-Language-Script Transfer and
Alignment for Sentiment Detection in Code-Mixed Data [0.0]
Code-switching entails mixing multiple languages. It is an increasingly occurring phenomenon in social media texts.
Pre-trained multilingual models primarily utilize the data in the native script of the language.
Using the native script for each language can generate better representations of the text owing to the pre-trained knowledge.
arXiv Detail & Related papers (2024-02-07T02:59:18Z) - AdaCCD: Adaptive Semantic Contrasts Discovery Based Cross Lingual
Adaptation for Code Clone Detection [69.79627042058048]
AdaCCD is a novel cross-lingual adaptation method that can detect cloned codes in a new language without annotations in that language.
We evaluate the cross-lingual adaptation results of AdaCCD by constructing a multilingual code clone detection benchmark consisting of 5 programming languages.
arXiv Detail & Related papers (2023-11-13T12:20:48Z) - Progressive Sentiment Analysis for Code-Switched Text Data [26.71396390928905]
We focus on code-switched sentiment analysis where we have a labelled resource-rich language dataset and unlabelled code-switched data.
We propose a framework that takes the distinction between resource-rich and low-resource language into account.
arXiv Detail & Related papers (2022-10-25T23:13:53Z) - Detecting Text Formality: A Study of Text Classification Approaches [78.11745751651708]
This work proposes the first to our knowledge systematic study of formality detection methods based on statistical, neural-based, and Transformer-based machine learning methods.
We conducted three types of experiments -- monolingual, multilingual, and cross-lingual.
The study shows the overcome of Char BiLSTM model over Transformer-based ones for the monolingual and multilingual formality classification task.
arXiv Detail & Related papers (2022-04-19T16:23:07Z) - Improving Code-switching Language Modeling with Artificially Generated
Texts using Cycle-consistent Adversarial Networks [41.88097793717185]
We investigate methods to augment Code-switching training text data by artificially generating them.
We propose a cycle-consistent adversarial networks based framework to transfer monolingual text into Code-switching text.
arXiv Detail & Related papers (2021-12-12T21:27:32Z) - From Machine Translation to Code-Switching: Generating High-Quality
Code-Switched Text [14.251949110756078]
We adapt a state-of-the-art neural machine translation model to generate Hindi-English code-switched sentences.
We show significant reductions in perplexity on a language modeling task.
We also show improvements using our text for a downstream code-switched natural language inference task.
arXiv Detail & Related papers (2021-07-14T04:46:39Z) - Deep Graph Matching and Searching for Semantic Code Retrieval [76.51445515611469]
We propose an end-to-end deep graph matching and searching model based on graph neural networks.
We first represent both natural language query texts and programming language code snippets with the unified graph-structured data.
In particular, DGMS not only captures more structural information for individual query texts or code snippets but also learns the fine-grained similarity between them.
arXiv Detail & Related papers (2020-10-24T14:16:50Z) - A Multi-Perspective Architecture for Semantic Code Search [58.73778219645548]
We propose a novel multi-perspective cross-lingual neural framework for code--text matching.
Our experiments on the CoNaLa dataset show that our proposed model yields better performance than previous approaches.
arXiv Detail & Related papers (2020-05-06T04:46:11Z) - UniLMv2: Pseudo-Masked Language Models for Unified Language Model
Pre-Training [152.63467944568094]
We propose to pre-train a unified language model for both autoencoding and partially autoregressive language modeling tasks.
Our experiments show that the unified language models pre-trained using PMLM achieve new state-of-the-art results on a wide range of natural language understanding and generation tasks.
arXiv Detail & Related papers (2020-02-28T15:28:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.