PESTO: Switching Point based Dynamic and Relative Positional Encoding
for Code-Mixed Languages
- URL: http://arxiv.org/abs/2111.06599v1
- Date: Fri, 12 Nov 2021 08:18:21 GMT
- Title: PESTO: Switching Point based Dynamic and Relative Positional Encoding
for Code-Mixed Languages
- Authors: Mohsin Ali, Kandukuri Sai Teja, Sumanth Manduru, Parth Patwa, Amitava
Das
- Abstract summary: We present our initial observations on applying switching point based positional encoding techniques for CM language.
Results are only marginally better than SOTA, but it is evident that positional encoding could bean effective way to train position sensitive language models for CM text.
- Score: 1.7073542935233876
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: NLP applications for code-mixed (CM) or mix-lingual text have gained a
significant momentum recently, the main reason being the prevalence of language
mixing in social media communications in multi-lingual societies like India,
Mexico, Europe, parts of USA etc. Word embeddings are basic build-ing blocks of
any NLP system today, yet, word embedding for CM languages is an unexplored
territory. The major bottleneck for CM word embeddings is switching points,
where the language switches. These locations lack in contextually and
statistical systems fail to model this phenomena due to high variance in the
seen examples. In this paper we present our initial observations on applying
switching point based positional encoding techniques for CM language,
specifically Hinglish (Hindi - English). Results are only marginally better
than SOTA, but it is evident that positional encoding could bean effective way
to train position sensitive language models for CM text.
Related papers
- Prompt Engineering Using GPT for Word-Level Code-Mixed Language Identification in Low-Resource Dravidian Languages [0.0]
In multilingual societies like India, text often exhibits code-mixing, blending local languages with English at different linguistic levels.
This paper introduces a prompt based method for a shared task aimed at addressing word-level LI challenges in Dravidian languages.
In this work, we leveraged GPT-3.5 Turbo to understand whether the large language models is able to correctly classify words into correct categories.
arXiv Detail & Related papers (2024-11-06T16:20:37Z) - CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text.
We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules.
COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z) - Exploring Multi-Lingual Bias of Large Code Models in Code Generation [55.336629780101475]
Code generation aims to synthesize code and fulfill functional requirements based on natural language (NL) specifications.
Despite the effectiveness, we observe a noticeable multilingual bias in the generation performance of large code models (LCMs)
LCMs demonstrate proficiency in generating solutions when provided with instructions in English, yet may falter when faced with semantically equivalent instructions in other NLs such as Chinese.
arXiv Detail & Related papers (2024-04-30T08:51:49Z) - MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling [70.34758460372629]
We introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages.
MYTE produces shorter encodings for all 99 analyzed languages.
This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.
arXiv Detail & Related papers (2024-03-15T21:21:11Z) - CONFLATOR: Incorporating Switching Point based Rotatory Positional
Encodings for Code-Mixed Language Modeling [10.26356931263957]
We introduce CONFLATOR: a neural language modeling approach for code-mixed languages.
We show that rotatory positional encodings along with switching point information yield the best results.
ConFLATOR outperforms the state-of-the-art on two tasks based on code-mixed Hindi and English.
arXiv Detail & Related papers (2023-09-11T07:02:13Z) - Language Agnostic Code-Mixing Data Augmentation by Predicting Linguistic
Patterns [0.5560631344057825]
We propose several Synthetic Code-Mixing (SCM) data augmentation methods that outperform the baseline on downstream sentiment analysis tasks.
Our proposed methods demonstrate that strategically replacing parts of sentences in the matrix language with a constant mask significantly improves classification accuracy.
We test our data augmentation method in a variety of low-resource and cross-lingual settings, reaching up to a relative improvement of 7.73% on the extremely scarce English-Malayalam dataset.
arXiv Detail & Related papers (2022-11-14T18:50:16Z) - A New Generation of Perspective API: Efficient Multilingual
Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw.
At the heart of the approach is a single multilingual token-free Charformer model.
We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z) - Multilingual Autoregressive Entity Linking [49.35994386221958]
mGENRE is a sequence-to-sequence system for the Multilingual Entity Linking problem.
For a mention in a given language, mGENRE predicts the name of the target entity left-to-right, token-by-token.
We show the efficacy of our approach through extensive evaluation including experiments on three popular MEL benchmarks.
arXiv Detail & Related papers (2021-03-23T13:25:55Z) - Evaluating Input Representation for Language Identification in
Hindi-English Code Mixed Text [4.4904382374090765]
Code-mixed text comprises text written in more than one language.
People naturally tend to combine local language with global languages like English.
In this work, we focus on language identification in code-mixed sentences for Hindi-English mixed text.
arXiv Detail & Related papers (2020-11-23T08:08:09Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Word Level Language Identification in English Telugu Code Mixed Data [7.538482310185133]
Intrasentential Code Switching (ICS) or Code Mixing (CM) is frequently observed nowadays.
We present a study of various models - Nave Bayes, Random Forest, Conditional Random Field (CRF), and Hidden Markov Model (HMM) for Language Identification.
Our best performing system is CRF-based with an f1-score of 0.91.
arXiv Detail & Related papers (2020-10-09T10:15:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.