Utilizing Out-Domain Datasets to Enhance Multi-Task Citation Analysis
- URL: http://arxiv.org/abs/2202.10884v1
- Date: Tue, 22 Feb 2022 13:33:48 GMT
- Title: Utilizing Out-Domain Datasets to Enhance Multi-Task Citation Analysis
- Authors: Dominique Mercier, Syed Tahseen Raza Rizvi, Vikas Rajashekar, Sheraz
Ahmed, Andreas Dengel
- Abstract summary: Citation sentiment analysis suffers from both data scarcity and tremendous costs for dataset annotation.
We explore the impact of out-domain data during training to enhance the model performance.
We propose an end-to-end trainable multi-task model that covers the sentiment and intent analysis.
- Score: 4.526582372434088
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Citations are generally analyzed using only quantitative measures while
excluding qualitative aspects such as sentiment and intent. However,
qualitative aspects provide deeper insights into the impact of a scientific
research artifact and make it possible to focus on relevant literature free
from bias associated with quantitative aspects. Therefore, it is possible to
rank and categorize papers based on their sentiment and intent. For this
purpose, larger citation sentiment datasets are required. However, from a time
and cost perspective, curating a large citation sentiment dataset is a
challenging task. Particularly, citation sentiment analysis suffers from both
data scarcity and tremendous costs for dataset annotation. To overcome the
bottleneck of data scarcity in the citation analysis domain we explore the
impact of out-domain data during training to enhance the model performance. Our
results emphasize the use of different scheduling methods based on the use
case. We empirically found that a model trained using sequential data
scheduling is more suitable for domain-specific usecases. Conversely, shuffled
data feeding achieves better performance on a cross-domain task. Based on our
findings, we propose an end-to-end trainable multi-task model that covers the
sentiment and intent analysis that utilizes out-domain datasets to overcome the
data scarcity.
Related papers
- Word Matters: What Influences Domain Adaptation in Summarization? [43.7010491942323]
This paper investigates the fine-grained factors affecting domain adaptation performance.
We propose quantifying dataset learning difficulty as the learning difficulty of generative summarization.
Our experiments conclude that, when considering dataset learning difficulty, the cross-domain overlap and the performance gain in summarization tasks exhibit an approximate linear relationship.
arXiv Detail & Related papers (2024-06-21T02:15:49Z) - Sexism Detection on a Data Diet [14.899608305188002]
We show how we can leverage influence scores to estimate the importance of a data point while training a model.
We evaluate the model performance trained on data pruned with different pruning strategies on three out-of-domain datasets.
arXiv Detail & Related papers (2024-06-07T12:39:54Z) - Test-time Assessment of a Model's Performance on Unseen Domains via Optimal Transport [8.425690424016986]
Gauging the performance of ML models on data from unseen domains at test-time is essential.
It is essential to develop metrics that can provide insights into the model's performance at test time.
We propose a metric based on Optimal Transport that is highly correlated with the model's performance on unseen domains.
arXiv Detail & Related papers (2024-05-02T16:35:07Z) - A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset.
Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive.
Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z) - Mere Contrastive Learning for Cross-Domain Sentiment Analysis [23.350121129347556]
Cross-domain sentiment analysis aims to predict the sentiment of texts in the target domain using the model trained on the source domain.
Previous studies are mostly cross-entropy-based methods for the task, which suffer from instability and poor generalization.
We propose a modified contrastive objective with in-batch negative samples so that the sentence representations from the same class will be pushed close.
arXiv Detail & Related papers (2022-08-18T07:25:55Z) - Deep Unsupervised Domain Adaptation: A Review of Recent Advances and
Perspectives [16.68091981866261]
Unsupervised domain adaptation (UDA) is proposed to counter the performance drop on data in a target domain.
UDA has yielded promising results on natural image processing, video analysis, natural language processing, time-series data analysis, medical image analysis, etc.
arXiv Detail & Related papers (2022-08-15T20:05:07Z) - CHALLENGER: Training with Attribution Maps [63.736435657236505]
We show that utilizing attribution maps for training neural networks can improve regularization of models and thus increase performance.
In particular, we show that our generic domain-independent approach yields state-of-the-art results in vision, natural language processing and on time series tasks.
arXiv Detail & Related papers (2022-05-30T13:34:46Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious.
We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z) - Improving Multi-Turn Response Selection Models with Complementary
Last-Utterance Selection by Instance Weighting [84.9716460244444]
We consider utilizing the underlying correlation in the data resource itself to derive different kinds of supervision signals.
We conduct extensive experiments in two public datasets and obtain significant improvement in both datasets.
arXiv Detail & Related papers (2020-02-18T06:29:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.