WADER at SemEval-2023 Task 9: A Weak-labelling framework for Data
augmentation in tExt Regression Tasks
- URL: http://arxiv.org/abs/2303.02758v1
- Date: Sun, 5 Mar 2023 19:45:42 GMT
- Title: WADER at SemEval-2023 Task 9: A Weak-labelling framework for Data
augmentation in tExt Regression Tasks
- Authors: Manan Suri, Aaryak Garg, Divya Chaudhary, Ian Gorton, Bijendra Kumar
- Abstract summary: In this paper, we propose a novel weak-labeling strategy for data augmentation in text regression tasks called WADER.
We benchmark the performance of State-of-the-Art pre-trained multilingual language models using WADER and analyze the use of sampling techniques to mitigate bias in data.
- Score: 4.102007186133394
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Intimacy is an essential element of human relationships and language is a
crucial means of conveying it. Textual intimacy analysis can reveal social
norms in different contexts and serve as a benchmark for testing computational
models' ability to understand social information. In this paper, we propose a
novel weak-labeling strategy for data augmentation in text regression tasks
called WADER. WADER uses data augmentation to address the problems of data
imbalance and data scarcity and provides a method for data augmentation in
cross-lingual, zero-shot tasks. We benchmark the performance of
State-of-the-Art pre-trained multilingual language models using WADER and
analyze the use of sampling techniques to mitigate bias in data and optimally
select augmentation candidates. Our results show that WADER outperforms the
baseline model and provides a direction for mitigating data imbalance and
scarcity in text regression tasks.
Related papers
- DPP-Based Adversarial Prompt Searching for Lanugage Models [56.73828162194457]
Auto-regressive Selective Replacement Ascent (ASRA) is a discrete optimization algorithm that selects prompts based on both quality and similarity with determinantal point process (DPP)
Experimental results on six different pre-trained language models demonstrate the efficacy of ASRA for eliciting toxic content.
arXiv Detail & Related papers (2024-03-01T05:28:06Z) - Evaluation of Faithfulness Using the Longest Supported Subsequence [52.27522262537075]
We introduce a novel approach to evaluate faithfulness of machine-generated text by computing the longest noncontinuous of the claim that is supported by the context.
Using a new human-annotated dataset, we finetune a model to generate Longest Supported Subsequence (LSS)
Our proposed metric demonstrates an 18% enhancement over the prevailing state-of-the-art metric for faithfulness on our dataset.
arXiv Detail & Related papers (2023-08-23T14:18:44Z) - UZH_CLyp at SemEval-2023 Task 9: Head-First Fine-Tuning and ChatGPT Data
Generation for Cross-Lingual Learning in Tweet Intimacy Prediction [3.1798318618973362]
This paper describes the submission of UZH_CLyp for the SemEval 2023 Task 9 "Multilingual Tweet Intimacy Analysis"
We achieved second-best results in all 10 languages according to the official Pearson's correlation regression evaluation measure.
arXiv Detail & Related papers (2023-03-02T12:18:53Z) - On the Usability of Transformers-based models for a French
Question-Answering task [2.44288434255221]
This paper focuses on the usability of Transformer-based language models in small-scale learning problems.
We introduce a new compact model for French FrALBERT which proves to be competitive in low-resource settings.
arXiv Detail & Related papers (2022-07-19T09:46:15Z) - Self-augmented Data Selection for Few-shot Dialogue Generation [18.794770678708637]
We adopt the self-training framework to deal with the few-shot MR-to-Text generation problem.
We propose a novel data selection strategy to select the data that our generation model is most uncertain about.
arXiv Detail & Related papers (2022-05-19T16:25:50Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - Improving Commonsense Causal Reasoning by Adversarial Training and Data
Augmentation [14.92157586545743]
This paper presents a number of techniques for making models more robust in the domain of causal reasoning.
We show a statistically significant improvement on performance and on both datasets, even with only a small number of additionally generated data points.
arXiv Detail & Related papers (2021-01-13T09:55:29Z) - SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation.
Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z) - DAGA: Data Augmentation with a Generation Approach for Low-resource
Tagging Tasks [88.62288327934499]
We propose a novel augmentation method with language models trained on the linearized labeled sentences.
Our method is applicable to both supervised and semi-supervised settings.
arXiv Detail & Related papers (2020-11-03T07:49:15Z) - Dynamic Data Selection and Weighting for Iterative Back-Translation [116.14378571769045]
We propose a curriculum learning strategy for iterative back-translation models.
We evaluate our models on domain adaptation, low-resource, and high-resource MT settings.
Experimental results demonstrate that our methods achieve improvements of up to 1.8 BLEU points over competitive baselines.
arXiv Detail & Related papers (2020-04-07T19:49:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.