Related papers: CONFLATOR: Incorporating Switching Point based Rotatory Positional Encodings for Code-Mixed Language Modeling

CONFLATOR: Incorporating Switching Point based Rotatory Positional Encodings for Code-Mixed Language Modeling

URL: http://arxiv.org/abs/2309.05270v2
Date: Wed, 18 Oct 2023 23:48:40 GMT
Title: CONFLATOR: Incorporating Switching Point based Rotatory Positional Encodings for Code-Mixed Language Modeling
Authors: Mohsin Ali, Kandukuri Sai Teja, Neeharika Gupta, Parth Patwa, Anubhab Chatterjee, Vinija Jain, Aman Chadha, Amitava Das
Abstract summary: We introduce CONFLATOR: a neural language modeling approach for code-mixed languages. We show that rotatory positional encodings along with switching point information yield the best results. ConFLATOR outperforms the state-of-the-art on two tasks based on code-mixed Hindi and English.
Score: 10.26356931263957
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The mixing of two or more languages is called Code-Mixing (CM). CM is a social norm in multilingual societies. Neural Language Models (NLMs) like transformers have been effective on many NLP tasks. However, NLM for CM is an under-explored area. Though transformers are capable and powerful, they cannot always encode positional information since they are non-recurrent. Therefore, to enrich word information and incorporate positional information, positional encoding is defined. We hypothesize that Switching Points (SPs), i.e., junctions in the text where the language switches (L1 -> L2 or L2 -> L1), pose a challenge for CM Language Models (LMs), and hence give special emphasis to SPs in the modeling process. We experiment with several positional encoding mechanisms and show that rotatory positional encodings along with switching point information yield the best results. We introduce CONFLATOR: a neural language modeling approach for code-mixed languages. CONFLATOR tries to learn to emphasize switching points using smarter positional encoding, both at unigram and bigram levels. CONFLATOR outperforms the state-of-the-art on two tasks based on code-mixed Hindi and English (Hinglish): (i) sentiment analysis and (ii) machine translation.

Related papers

Exploring Multi-Lingual Bias of Large Code Models in Code Generation [55.336629780101475]
Code generation aims to synthesize code and fulfill functional requirements based on natural language (NL) specifications. Despite the effectiveness, we observe a noticeable multilingual bias in the generation performance of large code models (LCMs) LCMs demonstrate proficiency in generating solutions when provided with instructions in English, yet may falter when faced with semantically equivalent instructions in other NLs such as Chinese.
arXiv Detail & Related papers (2024-04-30T08:51:49Z)
IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators [49.903001442804594]
This work investigates the prospect of leveraging compiler intermediate representations (IR) to improve the multilingual capabilities of Code-LMs. We first compile SLTrans, a parallel dataset consisting of nearly 4M self-contained source code files. Next, we carry out continued causal language modelling training on SLTrans, forcing the Code-LMs to learn the IR language. Our resulting models, dubbed IRCoder, display sizeable and consistent gains across a wide variety of code generation tasks and metrics.
arXiv Detail & Related papers (2024-03-06T17:52:08Z)
Two Stones Hit One Bird: Bilevel Positional Encoding for Better Length Extrapolation [69.68831888599476]
We develop a new positional encoding method called Bilevel Positional. Ethicical analysis shows this disentanglement of positional information makes learning more effective. Our BiPE has superior length extrapolation capabilities across a wide range of tasks in diverse text modalities.
arXiv Detail & Related papers (2024-01-29T18:59:07Z)
Converting Epics/Stories into Pseudocode using Transformers [0.0]
Pseudocode is a programming language representation of the steps involved in a computer program. We present a methodology to convert a problem described in the English language into pseudocode. We find that the CodeT5 model gives the best results in terms of BLEU score when trained separately on the two subtasks mentioned above.
arXiv Detail & Related papers (2023-12-08T14:01:09Z)
The Locality and Symmetry of Positional Encodings [9.246374019271938]
We conduct a systematic study of positional encodings in textbfBi Masked Language Models (BERT-style) We uncover the core function of PEs by identifying two common properties, Locality and Symmetry. We quantify the weakness of current PEs by introducing two new probing tasks, on which current PEs perform poorly.
arXiv Detail & Related papers (2023-10-19T16:15:15Z)
LAE: Language-Aware Encoder for Monolingual and Multilingual ASR [87.74794847245536]
A novel language-aware encoder (LAE) architecture is proposed to handle both situations by disentangling language-specific information. Experiments conducted on Mandarin-English code-switched speech suggest that the proposed LAE is capable of discriminating different languages in frame-level.
arXiv Detail & Related papers (2022-06-05T04:03:12Z)
PESTO: Switching Point based Dynamic and Relative Positional Encoding for Code-Mixed Languages [1.7073542935233876]
We present our initial observations on applying switching point based positional encoding techniques for CM language. Results are only marginally better than SOTA, but it is evident that positional encoding could bean effective way to train position sensitive language models for CM text.
arXiv Detail & Related papers (2021-11-12T08:18:21Z)
The Impact of Positional Encodings on Multilingual Compression [3.454503173118508]
Several modifications have been proposed over the sinusoidal positional encodings used in the original transformer architecture. We first show that surprisingly, while these modifications tend to improve monolingual language models, none of them result in better multilingual language models.
arXiv Detail & Related papers (2021-09-11T23:22:50Z)
DeltaLM: Encoder-Decoder Pre-training for Language Generation and Translation by Augmenting Pretrained Multilingual Encoders [92.90543340071007]
We introduce DeltaLM, a pretrained multilingual encoder-decoder model. Specifically, we augment the pretrained multilingual encoder with a decoder and pre-train it in a self-supervised way. Experiments show that DeltaLM outperforms various strong baselines on both natural language generation and translation tasks.
arXiv Detail & Related papers (2021-06-25T16:12:10Z)
Word Level Language Identification in English Telugu Code Mixed Data [7.538482310185133]
Intrasentential Code Switching (ICS) or Code Mixing (CM) is frequently observed nowadays. We present a study of various models - Nave Bayes, Random Forest, Conditional Random Field (CRF), and Hidden Markov Model (HMM) for Language Identification. Our best performing system is CRF-based with an f1-score of 0.91.
arXiv Detail & Related papers (2020-10-09T10:15:06Z)
Bi-Decoder Augmented Network for Neural Machine Translation [108.3931242633331]
We propose a novel Bi-Decoder Augmented Network (BiDAN) for the neural machine translation task. Since each decoder transforms the representations of the input text into its corresponding language, jointly training with two target ends can make the shared encoder has the potential to produce a language-independent semantic space.
arXiv Detail & Related papers (2020-01-14T02:05:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.