Related papers: Word Matters: What Influences Domain Adaptation in Summarization?

Word Matters: What Influences Domain Adaptation in Summarization?

URL: http://arxiv.org/abs/2406.14828v1
Date: Fri, 21 Jun 2024 02:15:49 GMT
Title: Word Matters: What Influences Domain Adaptation in Summarization?
Authors: Yinghao Li, Siyu Miao, Heyan Huang, Yang Gao,
Abstract summary: This paper investigates the fine-grained factors affecting domain adaptation performance. We propose quantifying dataset learning difficulty as the learning difficulty of generative summarization. Our experiments conclude that, when considering dataset learning difficulty, the cross-domain overlap and the performance gain in summarization tasks exhibit an approximate linear relationship.
Score: 43.7010491942323
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Domain adaptation aims to enable Large Language Models (LLMs) to generalize domain datasets unseen effectively during the training phase. However, factors such as the size of the model parameters and the scale of training data are general influencers and do not reflect the nuances of domain adaptation performance. This paper investigates the fine-grained factors affecting domain adaptation performance, analyzing the specific impact of `words' in training data on summarization tasks. We propose quantifying dataset learning difficulty as the learning difficulty of generative summarization, which is determined by two indicators: word-based compression rate and abstraction level. Our experiments conclude that, when considering dataset learning difficulty, the cross-domain overlap and the performance gain in summarization tasks exhibit an approximate linear relationship, which is not directly related to the number of words. Based on this finding, predicting a model's performance on unknown domain datasets is possible without undergoing training.

Related papers

DIDS: Domain Impact-aware Data Sampling for Large Language Model Training [41.86545248261005]
We present Domain Impact-aware Data Sampling (DIDS) to optimize domain-level sampling strategies. DIDS achieves 3.4% higher average performance while maintaining comparable training efficiency.
arXiv Detail & Related papers (2025-04-17T13:09:38Z)
Evaluating Data Influence in Meta Learning [6.757424294625179]
We propose a general data attribution evaluation framework for meta-learning within the bilevel optimization framework. This framework comprehensively models data contributions across both the inner and outer training processes.
arXiv Detail & Related papers (2025-01-27T11:14:04Z)
Capturing the Temporal Dependence of Training Data Influence [100.91355498124527]
We formalize the concept of trajectory-specific leave-one-out influence, which quantifies the impact of removing a data point during training. We propose data value embedding, a novel technique enabling efficient approximation of trajectory-specific LOO. As data value embedding captures training data ordering, it offers valuable insights into model training dynamics.
arXiv Detail & Related papers (2024-12-12T18:28:55Z)
Most Influential Subset Selection: Challenges, Promises, and Beyond [9.479235005673683]
We study the Most Influential Subset Selection (MISS) problem, which aims to identify a subset of training samples with the greatest collective influence. We conduct a comprehensive analysis of the prevailing approaches in MISS, elucidating their strengths and weaknesses. We demonstrate that an adaptive version of theses which applies them iteratively, can effectively capture the interactions among samples.
arXiv Detail & Related papers (2024-09-25T20:00:23Z)
From Text to Treatment Effects: A Meta-Learning Approach to Handling Text-Based Confounding [7.5348062792]
This paper examines the performance of meta-learners when confounding variables are expressed in text. We show that learners using pre-trained text representations of confounders achieve improved CATE estimates. Due to the entangled nature of the text embeddings, these models do not fully match the performance of meta-learners with perfect confounder knowledge.
arXiv Detail & Related papers (2024-09-23T19:46:19Z)
Sexism Detection on a Data Diet [14.899608305188002]
We show how we can leverage influence scores to estimate the importance of a data point while training a model. We evaluate the model performance trained on data pruned with different pruning strategies on three out-of-domain datasets.
arXiv Detail & Related papers (2024-06-07T12:39:54Z)
Scaling Laws for the Value of Individual Data Points in Machine Learning [55.596413470429475]
We introduce a new perspective by investigating scaling behavior for the value of individual data points. We provide learning theory to support our scaling law, and we observe empirically that it holds across diverse model classes. Our work represents a first step towards understanding and utilizing scaling properties for the value of individual data points.
arXiv Detail & Related papers (2024-05-30T20:10:24Z)
SALUDA: Surface-based Automotive Lidar Unsupervised Domain Adaptation [62.889835139583965]
We introduce an unsupervised auxiliary task of learning an implicit underlying surface representation simultaneously on source and target data. As both domains share the same latent representation, the model is forced to accommodate discrepancies between the two sources of data. Our experiments demonstrate that our method achieves a better performance than the current state of the art, both in real-to-real and synthetic-to-real scenarios.
arXiv Detail & Related papers (2023-04-06T17:36:23Z)
CHALLENGER: Training with Attribution Maps [63.736435657236505]
We show that utilizing attribution maps for training neural networks can improve regularization of models and thus increase performance. In particular, we show that our generic domain-independent approach yields state-of-the-art results in vision, natural language processing and on time series tasks.
arXiv Detail & Related papers (2022-05-30T13:34:46Z)
Data-Centric Machine Learning in the Legal Domain [0.2624902795082451]
This paper explores how changes in a data set influence the measured performance of a model. Using three publicly available data sets from the legal domain, we investigate how changes to their size, the train/test splits, and the human labelling accuracy impact the performance. The observed effects are surprisingly pronounced, especially when the per-class performance is considered.
arXiv Detail & Related papers (2022-01-17T23:05:14Z)
Representation Matters: Assessing the Importance of Subgroup Allocations in Training Data [85.43008636875345]
We show that diverse representation in training data is key to increasing subgroup performances and achieving population level objectives. Our analysis and experiments describe how dataset compositions influence performance and provide constructive results for using trends in existing data, alongside domain knowledge, to help guide intentional, objective-aware dataset design.
arXiv Detail & Related papers (2021-03-05T00:27:08Z)
Improving Commonsense Causal Reasoning by Adversarial Training and Data Augmentation [14.92157586545743]
This paper presents a number of techniques for making models more robust in the domain of causal reasoning. We show a statistically significant improvement on performance and on both datasets, even with only a small number of additionally generated data points.
arXiv Detail & Related papers (2021-01-13T09:55:29Z)
Improving Multi-Turn Response Selection Models with Complementary Last-Utterance Selection by Instance Weighting [84.9716460244444]
We consider utilizing the underlying correlation in the data resource itself to derive different kinds of supervision signals. We conduct extensive experiments in two public datasets and obtain significant improvement in both datasets.
arXiv Detail & Related papers (2020-02-18T06:29:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.