A Comprehensive Understanding of Code-mixed Language Semantics using
Hierarchical Transformer
- URL: http://arxiv.org/abs/2204.12753v1
- Date: Wed, 27 Apr 2022 07:50:18 GMT
- Title: A Comprehensive Understanding of Code-mixed Language Semantics using
Hierarchical Transformer
- Authors: Ayan Sengupta, Tharun Suresh, Md Shad Akhtar, and Tanmoy Chakraborty
- Abstract summary: We propose a hierarchical transformer-based architecture (HIT) to learn the semantics of code-mixed languages.
We evaluate the proposed method across 6 Indian languages and 9 NLP tasks on 17 datasets.
- Score: 28.3684494647968
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Being a popular mode of text-based communication in multilingual communities,
code-mixing in online social media has became an important subject to study.
Learning the semantics and morphology of code-mixed language remains a key
challenge, due to scarcity of data and unavailability of robust and
language-invariant representation learning technique. Any morphologically-rich
language can benefit from character, subword, and word-level embeddings, aiding
in learning meaningful correlations. In this paper, we explore a hierarchical
transformer-based architecture (HIT) to learn the semantics of code-mixed
languages. HIT consists of multi-headed self-attention and outer product
attention components to simultaneously comprehend the semantic and syntactic
structures of code-mixed texts. We evaluate the proposed method across 6 Indian
languages (Bengali, Gujarati, Hindi, Tamil, Telugu and Malayalam) and Spanish
for 9 NLP tasks on 17 datasets. The HIT model outperforms state-of-the-art
code-mixed representation learning and multilingual language models in all
tasks. We further demonstrate the generalizability of the HIT architecture
using masked language modeling-based pre-training, zero-shot learning, and
transfer learning approaches. Our empirical results show that the pre-training
objectives significantly improve the performance on downstream tasks.
Related papers
- Prompt Engineering Using GPT for Word-Level Code-Mixed Language Identification in Low-Resource Dravidian Languages [0.0]
In multilingual societies like India, text often exhibits code-mixing, blending local languages with English at different linguistic levels.
This paper introduces a prompt based method for a shared task aimed at addressing word-level LI challenges in Dravidian languages.
In this work, we leveraged GPT-3.5 Turbo to understand whether the large language models is able to correctly classify words into correct categories.
arXiv Detail & Related papers (2024-11-06T16:20:37Z) - Leveraging Language Identification to Enhance Code-Mixed Text
Classification [0.7340017786387767]
Existing deep-learning models do not take advantage of the implicit language information in code-mixed text.
Our study aims to improve BERT-based models performance on low-resource Code-Mixed Hindi-English datasets.
arXiv Detail & Related papers (2023-06-08T06:43:10Z) - Multi-level Contrastive Learning for Cross-lingual Spoken Language
Understanding [90.87454350016121]
We develop novel code-switching schemes to generate hard negative examples for contrastive learning at all levels.
We develop a label-aware joint model to leverage label semantics for cross-lingual knowledge transfer.
arXiv Detail & Related papers (2022-05-07T13:44:28Z) - Pre-Trained Language Models for Interactive Decision-Making [72.77825666035203]
We describe a framework for imitation learning in which goals and observations are represented as a sequence of embeddings.
We demonstrate that this framework enables effective generalization across different environments.
For test tasks involving novel goals or novel scenes, initializing policies with language models improves task completion rates by 43.6%.
arXiv Detail & Related papers (2022-02-03T18:55:52Z) - Cross-Lingual Adaptation for Type Inference [29.234418962960905]
We propose a cross-lingual adaptation framework, PLATO, to transfer a deep learning-based type inference procedure across weakly typed languages.
By leveraging data from strongly typed languages, PLATO improves the perplexity of the backbone cross-programming-language model.
arXiv Detail & Related papers (2021-07-01T00:20:24Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - HIT: A Hierarchically Fused Deep Attention Network for Robust Code-mixed
Language Representation [18.136640008855117]
We propose HIT, a robust representation learning method for code-mixed texts.
HIT is a hierarchical transformer-based framework that captures the semantic relationship among words.
Our evaluation of HIT on one European (Spanish) and five Indic (Hindi, Bengali, Tamil, Telugu, and Malayalam) languages suggests significant performance improvement against various state-of-the-art systems.
arXiv Detail & Related papers (2021-05-30T18:53:33Z) - Multilingual Transfer Learning for Code-Switched Language and Speech
Neural Modeling [12.497781134446898]
We address the data scarcity and limitations of linguistic theory by proposing language-agnostic multi-task training methods.
First, we introduce a meta-learning-based approach, meta-transfer learning, in which information is judiciously extracted from high-resource monolingual speech data to the code-switching domain.
Second, we propose a novel multilingual meta-ems approach to effectively represent code-switching data by acquiring useful knowledge learned in other languages.
Third, we introduce multi-task learning to integrate syntactic information as a transfer learning strategy to a language model and learn where to code-switch.
arXiv Detail & Related papers (2021-04-13T14:49:26Z) - VECO: Variable and Flexible Cross-lingual Pre-training for Language
Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages.
It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language.
The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.