Revealing the impact of synthetic native samples and multi-tasking strategies in Hindi-English code-mixed humour and sarcasm detection
- URL: http://arxiv.org/abs/2412.12761v1
- Date: Tue, 17 Dec 2024 10:26:54 GMT
- Title: Revealing the impact of synthetic native samples and multi-tasking strategies in Hindi-English code-mixed humour and sarcasm detection
- Authors: Debajyoti Mazumder, Aakash Kumar, Jasabanta Patro,
- Abstract summary: We experimented with three approaches to improve code-mixed humour and sarcasm detection.<n>In native sample mixing, we added monolingual task samples in code-mixed training sets.<n>In MTL learning, we relied on native and code-mixed samples of a semantically related task.<n>In VMLM learning, we evaluated the efficacy of VMLMs via few-shot context prompting.
- Score: 2.7582789611575897
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we reported our experiments with various strategies to improve code-mixed humour and sarcasm detection. We did all of our experiments for Hindi-English code-mixed scenario, as we have the linguistic expertise for the same. We experimented with three approaches, namely (i) native sample mixing, (ii) multi-task learning (MTL), and (iii) prompting very large multilingual language models (VMLMs). In native sample mixing, we added monolingual task samples in code-mixed training sets. In MTL learning, we relied on native and code-mixed samples of a semantically related task (hate detection in our case). Finally, in our third approach, we evaluated the efficacy of VMLMs via few-shot context prompting. Some interesting findings we got are (i) adding native samples improved humor (raising the F1-score up to 6.76%) and sarcasm (raising the F1-score up to 8.64%) detection, (ii) training MLMs in an MTL framework boosted performance for both humour (raising the F1-score up to 10.67%) and sarcasm (increment up to 12.35% in F1-score) detection, and (iii) prompting VMLMs couldn't outperform the other approaches. Finally, our ablation studies and error analysis discovered the cases where our model is yet to improve. We provided our code for reproducibility.
Related papers
- SOI Matters: Analyzing Multi-Setting Training Dynamics in Pretrained Language Models via Subsets of Interest [5.882817862856554]
This work investigates the impact of multi-task, multi-lingual, and multi-source learning approaches on the robustness and performance of pretrained language models.<n>Subsets of Interest (SOI) identifies six distinct learning behavior patterns during training, including forgettable examples, unlearned examples, and always correct examples.<n>Our results demonstrate that multi-source learning consistently improves out-of-distribution performance by up to 7%, while multi-task learning shows mixed results with notable gains in similar task combinations.
arXiv Detail & Related papers (2025-07-21T04:43:21Z) - MSA at SemEval-2025 Task 3: High Quality Weak Labeling and LLM Ensemble Verification for Multilingual Hallucination Detection [0.0]
This paper describes our submission for SemEval-2025 Task 3: Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes.<n>The task involves detecting hallucinated spans in text generated by instruction-tuned Large Language Models (LLMs) across multiple languages.<n>Our system ranked 1st in Arabic and Basque, 2nd in German, Swedish, and Finnish, and 3rd in Czech, Farsi, and French.
arXiv Detail & Related papers (2025-05-27T08:26:17Z) - Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models [55.14276067678253]
This paper introduces a novel methodology for efficiently identifying inherent cross-lingual weaknesses in Large Language Models (LLMs)<n>We construct a new dataset of over 6,000 bilingual pairs across 16 languages using this methodology, demonstrating its effectiveness in revealing weaknesses even in state-of-the-art models.<n>Further experiments investigate the relationship between linguistic similarity and cross-lingual weaknesses, revealing that linguistically related languages share similar performance patterns.
arXiv Detail & Related papers (2025-05-24T12:31:27Z) - Test-Time Code-Switching for Cross-lingual Aspect Sentiment Triplet Extraction [12.269762062755492]
We introduce a novel Test-Time Code-SWitching (TT-CSW) framework to bridge the gap between the bilingual training phase and the monolingual test-time prediction.
During training, a generative model is developed based on bilingual code-switched training data and can produce bilingual ASTE triplets for bilingual inputs.
In the testing stage, we employ an alignment-based code-switching technique for test-time augmentation.
arXiv Detail & Related papers (2025-01-24T00:00:51Z) - Improving code-mixed hate detection by native sample mixing: A case study for Hindi-English code-mixed scenario [2.7582789611575897]
This paper attempts to fill the gap through rigorous empirical experiments.
We consider the Hindi-English code-mixed setup as a case study.
Adding native hate samples in the code-mixed training set, even in small quantity, improved the performance of literatures for code-mixed hate detection.
arXiv Detail & Related papers (2024-05-31T11:43:31Z) - TEII: Think, Explain, Interact and Iterate with Large Language Models to Solve Cross-lingual Emotion Detection [5.942385193284472]
Cross-lingual emotion detection allows us to analyze global trends, public opinion, and social phenomena at scale.
Our system outperformed the baseline by more than 0.16 F1-score absolute, and ranked second amongst competing systems.
arXiv Detail & Related papers (2024-05-27T12:47:40Z) - Improving Multi-lingual Alignment Through Soft Contrastive Learning [9.454626745893798]
We propose a novel method to align multi-lingual embeddings based on the similarity of sentences measured by a pre-trained mono-lingual embedding model.
Given translation sentence pairs, we train a multi-lingual model in a way that the similarity between cross-lingual embeddings follows the similarity of sentences measured at the mono-lingual teacher model.
arXiv Detail & Related papers (2024-05-25T09:46:07Z) - Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations [59.056367787688146]
This paper pioneers exploring and training powerful Multilingual Math Reasoning (xMR) LLMs.
We construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
By utilizing translation, we construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
arXiv Detail & Related papers (2023-10-31T08:09:20Z) - Large Language Model-Aware In-Context Learning for Code Generation [75.68709482932903]
Large language models (LLMs) have shown impressive in-context learning (ICL) ability in code generation.
We propose a novel learning-based selection approach named LAIL (LLM-Aware In-context Learning) for code generation.
arXiv Detail & Related papers (2023-10-15T06:12:58Z) - Joint Prediction and Denoising for Large-scale Multilingual
Self-supervised Learning [69.77973092264338]
We show that more powerful techniques can lead to more efficient pre-training, opening SSL to more research groups.
We propose WavLabLM, which extends WavLM's joint prediction and denoising to 40k hours of data across 136 languages.
We show that further efficiency can be achieved with a vanilla HuBERT Base model, which can maintain 94% of XLS-R's performance with only 3% of the data.
arXiv Detail & Related papers (2023-09-26T23:55:57Z) - mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Models [57.225289079198454]
We propose mPLMSim, a language similarity measure that induces the similarities across languages from mPLMs using multi-parallel corpora.
Our study shows that mPLM-Sim exhibits moderately high correlations with linguistic similarity measures, such as lexico, genealogical language family, and geographical sprachbund.
We further investigate whether mPLMSim is effective for zero-shot cross-lingual transfer by conducting experiments on both low-level syntactic tasks and high-level semantic tasks.
arXiv Detail & Related papers (2023-05-23T04:44:26Z) - Prompting Large Language Model for Machine Translation: A Case Study [87.88120385000666]
We offer a systematic study on prompting strategies for machine translation.
We examine factors for prompt template and demonstration example selection.
We explore the use of monolingual data and the feasibility of cross-lingual, cross-domain, and sentence-to-document transfer learning.
arXiv Detail & Related papers (2023-01-17T18:32:06Z) - FewCLUE: A Chinese Few-shot Learning Evaluation Benchmark [8.158067688043554]
This work first introduces Chinese Few-shot Learning Evaluation Benchmark (FewCLUE), the first comprehensive small sample evaluation benchmark in Chinese.
An unlabeled training set with up to 20,000 additional samples per task is provided, allowing researchers to explore better ways of using unlabeled samples.
Next, we implement a set of state-of-the-art few-shot learning methods, and compare their performance with fine-tuning and zero-shot learning schemes on the newly constructed FewCLUE benchmark.
arXiv Detail & Related papers (2021-07-15T17:51:25Z) - Explicit Alignment Objectives for Multilingual Bidirectional Encoders [111.65322283420805]
We present a new method for learning multilingual encoders, AMBER (Aligned Multilingual Bi-directional EncodeR)
AMBER is trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities.
Experimental results show that AMBER obtains gains of up to 1.1 average F1 score on sequence tagging and up to 27.3 average accuracy on retrieval over the XLMR-large model.
arXiv Detail & Related papers (2020-10-15T18:34:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.