Word Matters: What Influences Domain Adaptation in Summarization?
- URL: http://arxiv.org/abs/2406.14828v1
- Date: Fri, 21 Jun 2024 02:15:49 GMT
- Title: Word Matters: What Influences Domain Adaptation in Summarization?
- Authors: Yinghao Li, Siyu Miao, Heyan Huang, Yang Gao,
- Abstract summary: This paper investigates the fine-grained factors affecting domain adaptation performance.
We propose quantifying dataset learning difficulty as the learning difficulty of generative summarization.
Our experiments conclude that, when considering dataset learning difficulty, the cross-domain overlap and the performance gain in summarization tasks exhibit an approximate linear relationship.
- Score: 43.7010491942323
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Domain adaptation aims to enable Large Language Models (LLMs) to generalize domain datasets unseen effectively during the training phase. However, factors such as the size of the model parameters and the scale of training data are general influencers and do not reflect the nuances of domain adaptation performance. This paper investigates the fine-grained factors affecting domain adaptation performance, analyzing the specific impact of `words' in training data on summarization tasks. We propose quantifying dataset learning difficulty as the learning difficulty of generative summarization, which is determined by two indicators: word-based compression rate and abstraction level. Our experiments conclude that, when considering dataset learning difficulty, the cross-domain overlap and the performance gain in summarization tasks exhibit an approximate linear relationship, which is not directly related to the number of words. Based on this finding, predicting a model's performance on unknown domain datasets is possible without undergoing training.
Related papers
- Evaluating Data Influence in Meta Learning [6.757424294625179]
We propose a general data attribution evaluation framework for meta-learning within the bilevel optimization framework.
This framework comprehensively models data contributions across both the inner and outer training processes.
arXiv Detail & Related papers (2025-01-27T11:14:04Z) - Quantifying the Importance of Data Alignment in Downstream Model Performance [1.2564343689544843]
We use the Task2Vec-based alignment coefficient to quantify the impact of alignment between training data and evaluation data on downstream performance.
We find a strong, predictable negative correlation between the alignment coefficient of a model's training and evaluation data and the model's loss/perplexity on the respective downstream task.
arXiv Detail & Related papers (2025-01-14T23:59:23Z) - Capturing the Temporal Dependence of Training Data Influence [100.91355498124527]
We formalize the concept of trajectory-specific leave-one-out influence, which quantifies the impact of removing a data point during training.
We propose data value embedding, a novel technique enabling efficient approximation of trajectory-specific LOO.
As data value embedding captures training data ordering, it offers valuable insights into model training dynamics.
arXiv Detail & Related papers (2024-12-12T18:28:55Z) - Most Influential Subset Selection: Challenges, Promises, and Beyond [9.479235005673683]
We study the Most Influential Subset Selection (MISS) problem, which aims to identify a subset of training samples with the greatest collective influence.
We conduct a comprehensive analysis of the prevailing approaches in MISS, elucidating their strengths and weaknesses.
We demonstrate that an adaptive version of theses which applies them iteratively, can effectively capture the interactions among samples.
arXiv Detail & Related papers (2024-09-25T20:00:23Z) - Sexism Detection on a Data Diet [14.899608305188002]
We show how we can leverage influence scores to estimate the importance of a data point while training a model.
We evaluate the model performance trained on data pruned with different pruning strategies on three out-of-domain datasets.
arXiv Detail & Related papers (2024-06-07T12:39:54Z) - SALUDA: Surface-based Automotive Lidar Unsupervised Domain Adaptation [62.889835139583965]
We introduce an unsupervised auxiliary task of learning an implicit underlying surface representation simultaneously on source and target data.
As both domains share the same latent representation, the model is forced to accommodate discrepancies between the two sources of data.
Our experiments demonstrate that our method achieves a better performance than the current state of the art, both in real-to-real and synthetic-to-real scenarios.
arXiv Detail & Related papers (2023-04-06T17:36:23Z) - CHALLENGER: Training with Attribution Maps [63.736435657236505]
We show that utilizing attribution maps for training neural networks can improve regularization of models and thus increase performance.
In particular, we show that our generic domain-independent approach yields state-of-the-art results in vision, natural language processing and on time series tasks.
arXiv Detail & Related papers (2022-05-30T13:34:46Z) - Data-Centric Machine Learning in the Legal Domain [0.2624902795082451]
This paper explores how changes in a data set influence the measured performance of a model.
Using three publicly available data sets from the legal domain, we investigate how changes to their size, the train/test splits, and the human labelling accuracy impact the performance.
The observed effects are surprisingly pronounced, especially when the per-class performance is considered.
arXiv Detail & Related papers (2022-01-17T23:05:14Z) - Representation Matters: Assessing the Importance of Subgroup Allocations
in Training Data [85.43008636875345]
We show that diverse representation in training data is key to increasing subgroup performances and achieving population level objectives.
Our analysis and experiments describe how dataset compositions influence performance and provide constructive results for using trends in existing data, alongside domain knowledge, to help guide intentional, objective-aware dataset design.
arXiv Detail & Related papers (2021-03-05T00:27:08Z) - Improving Multi-Turn Response Selection Models with Complementary
Last-Utterance Selection by Instance Weighting [84.9716460244444]
We consider utilizing the underlying correlation in the data resource itself to derive different kinds of supervision signals.
We conduct extensive experiments in two public datasets and obtain significant improvement in both datasets.
arXiv Detail & Related papers (2020-02-18T06:29:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.