CMLFormer: A Dual Decoder Transformer with Switching Point Learning for Code-Mixed Language Modeling
- URL: http://arxiv.org/abs/2505.12587v1
- Date: Mon, 19 May 2025 00:50:49 GMT
- Title: CMLFormer: A Dual Decoder Transformer with Switching Point Learning for Code-Mixed Language Modeling
- Authors: Aditeya Baral, Allen George Ajith, Roshan Nayak, Mrityunjay Abhijeet Bhanja,
- Abstract summary: CMLFormer is an enhanced multi-layer dual-decoder Transformer with a shared encoder and synchronized decoder cross-attention.<n>CMLFormer is pre-trained on an augmented Hinglish corpus with switching point and translation annotations.<n>Experiments show that CMLFormer improves F1 score, precision, and accuracy over other approaches.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Code-mixed languages, characterized by frequent within-sentence language transitions, present structural challenges that standard language models fail to address. In this work, we propose CMLFormer, an enhanced multi-layer dual-decoder Transformer with a shared encoder and synchronized decoder cross-attention, designed to model the linguistic and semantic dynamics of code-mixed text. CMLFormer is pre-trained on an augmented Hinglish corpus with switching point and translation annotations with multiple new objectives specifically aimed at capturing switching behavior, cross-lingual structure, and code-mixing complexity. Our experiments show that CMLFormer improves F1 score, precision, and accuracy over other approaches on the HASOC-2021 benchmark under select pre-training setups. Attention analyses further show that it can identify and attend to switching points, validating its sensitivity to code-mixed structure. These results demonstrate the effectiveness of CMLFormer's architecture and multi-task pre-training strategy for modeling code-mixed languages.
Related papers
- IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z) - Investigating and Scaling up Code-Switching for Multilingual Language Model Pre-Training [58.696660064190475]
We find that the existence of code-switching, alternating between different languages within a context, is key to multilingual capabilities.<n>To better explore the power of code-switching for language alignment during pre-training, we investigate the strategy of synthetic code-switching.
arXiv Detail & Related papers (2025-04-02T15:09:58Z) - CLIP-SLA: Parameter-Efficient CLIP Adaptation for Continuous Sign Language Recognition [1.534667887016089]
Continuous sign language recognition (CSLR) focuses on interpreting and transcribing sequences of sign language gestures in videos.<n>We propose CLIP sign language adaptation (CLIP-SLA), a novel CSLR framework that leverages the powerful pre-trained visual encoder from the CLIP model to sign language tasks.<n>We introduce two variants, SLA-Adapter and SLA-LoRA, which integrate PEFT modules into the CLIP visual encoder, enabling fine-tuning with minimal trainable parameters.
arXiv Detail & Related papers (2025-04-02T12:15:33Z) - Adapting Whisper for Code-Switching through Encoding Refining and Language-Aware Decoding [27.499426765845705]
Code-switching automatic speech recognition (ASR) faces challenges due to the language confusion resulting from accents, auditory similarity, and seamless language switches.<n>We adapt Whisper, which is a large-scale multilingual pre-trained speech recognition model, to CS from both encoder and decoder parts.
arXiv Detail & Related papers (2024-12-21T07:06:44Z) - Calibration & Reconstruction: Deep Integrated Language for Referring Image Segmentation [8.225408779913712]
Referring image segmentation aims to segment an object referred to by natural language expression from an image.
Conventional transformer decoders can distort linguistic information with deeper layers, leading to suboptimal results.
We introduce CRFormer, a model that iteratively calibrates multi-modal features in the transformer decoder.
arXiv Detail & Related papers (2024-04-12T07:13:32Z) - AdaCCD: Adaptive Semantic Contrasts Discovery Based Cross Lingual
Adaptation for Code Clone Detection [69.79627042058048]
AdaCCD is a novel cross-lingual adaptation method that can detect cloned codes in a new language without annotations in that language.
We evaluate the cross-lingual adaptation results of AdaCCD by constructing a multilingual code clone detection benchmark consisting of 5 programming languages.
arXiv Detail & Related papers (2023-11-13T12:20:48Z) - CONFLATOR: Incorporating Switching Point based Rotatory Positional
Encodings for Code-Mixed Language Modeling [10.26356931263957]
We introduce CONFLATOR: a neural language modeling approach for code-mixed languages.
We show that rotatory positional encodings along with switching point information yield the best results.
ConFLATOR outperforms the state-of-the-art on two tasks based on code-mixed Hindi and English.
arXiv Detail & Related papers (2023-09-11T07:02:13Z) - Examining Scaling and Transfer of Language Model Architectures for
Machine Translation [51.69212730675345]
Language models (LMs) process sequences in a single stack of layers, and encoder-decoder models (EncDec) utilize separate layer stacks for input and output processing.
In machine translation, EncDec has long been the favoured approach, but with few studies investigating the performance of LMs.
arXiv Detail & Related papers (2022-02-01T16:20:15Z) - VECO: Variable and Flexible Cross-lingual Pre-training for Language
Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages.
It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language.
The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z) - UniLMv2: Pseudo-Masked Language Models for Unified Language Model
Pre-Training [152.63467944568094]
We propose to pre-train a unified language model for both autoencoding and partially autoregressive language modeling tasks.
Our experiments show that the unified language models pre-trained using PMLM achieve new state-of-the-art results on a wide range of natural language understanding and generation tasks.
arXiv Detail & Related papers (2020-02-28T15:28:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.