CodemixedNLP: An Extensible and Open NLP Toolkit for Code-Mixing
- URL: http://arxiv.org/abs/2106.06004v1
- Date: Thu, 10 Jun 2021 18:49:29 GMT
- Title: CodemixedNLP: An Extensible and Open NLP Toolkit for Code-Mixing
- Authors: Sai Muralidhar Jayanthi, Kavya Nerella, Khyathi Raghavi Chandu, Alan W
Black
- Abstract summary: We present Codemixed, an open-source library with the goals of bringing together the advances in code-mixed NLP and opening it up to a wider machine learning community.
The library consists of tools to develop and benchmark versatile model architectures that are tailored for mixed texts, methods to expand training sets, techniques to quantify mixing styles, and fine-tuned state-of-the-art models for 7 tasks in Hinglish.
- Score: 44.54537067761167
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The NLP community has witnessed steep progress in a variety of tasks across
the realms of monolingual and multilingual language processing recently. These
successes, in conjunction with the proliferating mixed language interactions on
social media have boosted interest in modeling code-mixed texts. In this work,
we present CodemixedNLP, an open-source library with the goals of bringing
together the advances in code-mixed NLP and opening it up to a wider machine
learning community. The library consists of tools to develop and benchmark
versatile model architectures that are tailored for mixed texts, methods to
expand training sets, techniques to quantify mixing styles, and fine-tuned
state-of-the-art models for 7 tasks in Hinglish. We believe this work has a
potential to foster a distributed yet collaborative and sustainable ecosystem
in an otherwise dispersed space of code-mixing research. The toolkit is
designed to be simple, easily extensible, and resourceful to both researchers
as well as practitioners.
Related papers
- Mixture-of-Instructions: Comprehensive Alignment of a Large Language Model through the Mixture of Diverse System Prompting Instructions [7.103987978402038]
We introduce a novel technique termed Mixture-of-Instructions (MoI)
MoI employs a strategy of instruction concatenation combined with diverse system prompts to boost the alignment efficiency of language models.
Our methodology was applied to the open-source Qwen-7B-chat model, culminating in the development of Qwen-SFT-MoI.
arXiv Detail & Related papers (2024-04-29T03:58:12Z) - CMULAB: An Open-Source Framework for Training and Deployment of Natural Language Processing Models [59.91221728187576]
This paper introduces the CMU Linguistic Linguistic Backend, an open-source framework that simplifies model deployment and continuous human-in-the-loop fine-tuning of NLP models.
CMULAB enables users to leverage the power of multilingual models to quickly adapt and extend existing tools for speech recognition, OCR, translation, and syntactic analysis to new languages.
arXiv Detail & Related papers (2024-04-03T02:21:46Z) - Synthetic Data Generation and Joint Learning for Robust Code-Mixed Translation [34.57825234659946]
We tackle the problem of code-mixed (Hinglish and Bengalish) to English machine translation.
We propose RCMT, a robust perturbation based joint-training model that learns to handle noise in the real-world code-mixed text.
Our evaluation and comprehensive analyses demonstrate the superiority of RCMT over state-of-the-art code-mixed and robust translation methods.
arXiv Detail & Related papers (2024-03-25T13:50:11Z) - SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for
Multi-modal Large Language Models [86.478087039015]
We present a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings.
Based on our proposed joint mixing, we propose an efficient strategy aiming to better capture fine-grained appearances of high-resolution images.
We hope our work may cast a light on the exploration of joint mixing in future MLLM research.
arXiv Detail & Related papers (2023-11-13T18:59:47Z) - Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi [0.0]
We introduce Tri-Distil-BERT, a multilingual model pre-trained on Bangla, English, and Hindi, and Mixed-Distil-BERT, a model fine-tuned on code-mixed data.
Our two-tiered pre-training approach offers efficient alternatives for multilingual and code-mixed language understanding.
arXiv Detail & Related papers (2023-09-19T02:59:41Z) - Prompting Multilingual Large Language Models to Generate Code-Mixed
Texts: The Case of South East Asian Languages [47.78634360870564]
We explore prompting multilingual models to generate code-mixed data for seven languages in South East Asia (SEA)
We find that publicly available multilingual instruction-tuned models such as BLOOMZ are incapable of producing texts with phrases or clauses from different languages.
ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing.
arXiv Detail & Related papers (2023-03-23T18:16:30Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - PreCogIIITH at HinglishEval : Leveraging Code-Mixing Metrics & Language
Model Embeddings To Estimate Code-Mix Quality [18.806186479627335]
We attempt to build models that impact the quality of synthetically generated code-mix text by predicting ratings for code-mix quality.
In our submission to HinglishEval, a shared-task collocated with INLG2022, we attempt to build models that impact the quality of synthetically generated code-mix text by predicting ratings for code-mix quality.
arXiv Detail & Related papers (2022-06-16T08:00:42Z) - A Comprehensive Understanding of Code-mixed Language Semantics using
Hierarchical Transformer [28.3684494647968]
We propose a hierarchical transformer-based architecture (HIT) to learn the semantics of code-mixed languages.
We evaluate the proposed method across 6 Indian languages and 9 NLP tasks on 17 datasets.
arXiv Detail & Related papers (2022-04-27T07:50:18Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.