Towards Model-Based Data Acquisition for Subjective Multi-Task NLP
Problems
- URL: http://arxiv.org/abs/2312.08198v1
- Date: Wed, 13 Dec 2023 15:03:27 GMT
- Title: Towards Model-Based Data Acquisition for Subjective Multi-Task NLP
Problems
- Authors: Kamil Kanclerz, Julita Bielaniewicz, Marcin Gruza, Jan Kocon,
Stanis{\l}aw Wo\'zniak, Przemys{\l}aw Kazienko
- Abstract summary: We propose a new model-based approach that allows the selection of tasks annotated individually for each text in a multi-task scenario.
Experiments carried out on three datasets, dozens of NLP tasks, and thousands of annotations show that our method allows up to 40% reduction in the number of annotations with negligible loss of knowledge.
- Score: 12.38430125789305
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data annotated by humans is a source of knowledge by describing the
peculiarities of the problem and therefore fueling the decision process of the
trained model. Unfortunately, the annotation process for subjective natural
language processing (NLP) problems like offensiveness or emotion detection is
often very expensive and time-consuming. One of the inevitable risks is to
spend some of the funds and annotator effort on annotations that do not provide
any additional knowledge about the specific task. To minimize these costs, we
propose a new model-based approach that allows the selection of tasks annotated
individually for each text in a multi-task scenario. The experiments carried
out on three datasets, dozens of NLP tasks, and thousands of annotations show
that our method allows up to 40% reduction in the number of annotations with
negligible loss of knowledge. The results also emphasize the need to collect a
diverse amount of data required to efficiently train a model, depending on the
subjectivity of the annotation task. We also focused on measuring the relation
between subjective tasks by evaluating the model in single-task and multi-task
scenarios. Moreover, for some datasets, training only on the labels predicted
by our model improved the efficiency of task selection as a self-supervised
learning regularization technique.
Related papers
- Combating Missing Modalities in Egocentric Videos at Test Time [92.38662956154256]
Real-world applications often face challenges with incomplete modalities due to privacy concerns, efficiency needs, or hardware issues.
We propose a novel approach to address this issue at test time without requiring retraining.
MiDl represents the first self-supervised, online solution for handling missing modalities exclusively at test time.
arXiv Detail & Related papers (2024-04-23T16:01:33Z) - Data-CUBE: Data Curriculum for Instruction-based Sentence Representation
Learning [85.66907881270785]
We propose a data curriculum method, namely Data-CUBE, that arranges the orders of all the multi-task data for training.
In the task level, we aim to find the optimal task order to minimize the total cross-task interference risk.
In the instance level, we measure the difficulty of all instances per task, then divide them into the easy-to-difficult mini-batches for training.
arXiv Detail & Related papers (2024-01-07T18:12:20Z) - Distribution Matching for Multi-Task Learning of Classification Tasks: a
Large-Scale Study on Faces & Beyond [62.406687088097605]
Multi-Task Learning (MTL) is a framework, where multiple related tasks are learned jointly and benefit from a shared representation space.
We show that MTL can be successful with classification tasks with little, or non-overlapping annotations.
We propose a novel approach, where knowledge exchange is enabled between the tasks via distribution matching.
arXiv Detail & Related papers (2024-01-02T14:18:11Z) - Exploring intra-task relations to improve meta-learning algorithms [1.223779595809275]
We aim to exploit external knowledge of task relations to improve training stability via effective mini-batching of tasks.
We hypothesize that selecting a diverse set of tasks in a mini-batch will lead to a better estimate of the full gradient and hence will lead to a reduction of noise in training.
arXiv Detail & Related papers (2023-12-27T15:33:52Z) - Preventing Catastrophic Forgetting in Continual Learning of New Natural
Language Tasks [17.879087904904935]
Multi-Task Learning (MTL) is widely-accepted in Natural Language Processing as a standard technique for learning multiple related tasks in one model.
As systems usually evolve over time, adding a new task to an existing MTL model usually requires retraining the model from scratch on all the tasks.
In this paper, we approach the problem of incrementally expanding MTL models' capability to solve new tasks over time by distilling the knowledge of an already trained model on n tasks into a new one for solving n+1 tasks.
arXiv Detail & Related papers (2023-02-22T00:18:25Z) - Multi-task Bias-Variance Trade-off Through Functional Constraints [102.64082402388192]
Multi-task learning aims to acquire a set of functions that perform well for diverse tasks.
In this paper we draw intuition from the two extreme learning scenarios -- a single function for all tasks, and a task-specific function that ignores the other tasks.
We introduce a constrained learning formulation that enforces domain specific solutions to a central function.
arXiv Detail & Related papers (2022-10-27T16:06:47Z) - Multi-task Active Learning for Pre-trained Transformer-based Models [22.228551277598804]
Multi-task learning, in which several tasks are jointly learned by a single model, allows NLP models to share information from multiple annotations.
This technique requires annotating the same text with multiple annotation schemes which may be costly and laborious.
Active learning (AL) has been demonstrated to optimize annotation processes by iteratively selecting unlabeled examples.
arXiv Detail & Related papers (2022-08-10T14:54:13Z) - KnowDA: All-in-One Knowledge Mixture Model for Data Augmentation in
Few-Shot NLP [68.43279384561352]
Existing data augmentation algorithms leverage task-independent rules or fine-tune general-purpose pre-trained language models.
These methods have trivial task-specific knowledge and are limited to yielding low-quality synthetic data for weak baselines in simple tasks.
We propose the Knowledge Mixture Data Augmentation Model (KnowDA): an encoder-decoder LM pretrained on a mixture of diverse NLP tasks.
arXiv Detail & Related papers (2022-06-21T11:34:02Z) - Weighted Training for Cross-Task Learning [71.94908559469475]
We introduce Target-Aware Weighted Training (TAWT), a weighted training algorithm for cross-task learning.
We show that TAWT is easy to implement, is computationally efficient, requires little hyper parameter tuning, and enjoys non-asymptotic learning-theoretic guarantees.
As a byproduct, the proposed representation-based task distance allows one to reason in a theoretically principled way about several critical aspects of cross-task learning.
arXiv Detail & Related papers (2021-05-28T20:27:02Z) - Label-Efficient Multi-Task Segmentation using Contrastive Learning [0.966840768820136]
We propose a multi-task segmentation model with a contrastive learning based subtask and compare its performance with other multi-task models.
We experimentally show that our proposed method outperforms other multi-task methods including the state-of-the-art fully supervised model when the amount of annotated data is limited.
arXiv Detail & Related papers (2020-09-23T14:12:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.