How to Determine the Most Powerful Pre-trained Language Model without
Brute Force Fine-tuning? An Empirical Survey
- URL: http://arxiv.org/abs/2312.04775v1
- Date: Fri, 8 Dec 2023 01:17:28 GMT
- Title: How to Determine the Most Powerful Pre-trained Language Model without
Brute Force Fine-tuning? An Empirical Survey
- Authors: Jun Bai, Xiaofeng Zhang, Chen Li, Hanhua Hong, Xi Xu, Chenghua Lin,
Wenge Rong
- Abstract summary: We show that H-Score generally performs well with superiorities in effectiveness and efficiency.
We also outline the difficulties of consideration of training details, applicability to text generation, and consistency to certain metrics which shed light on future directions.
- Score: 23.757740341834126
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transferability estimation has been attached to great attention in the
computer vision fields. Researchers try to estimate with low computational cost
the performance of a model when transferred from a source task to a given
target task. Considering the effectiveness of such estimations, the communities
of natural language processing also began to study similar problems for the
selection of pre-trained language models. However, there is a lack of a
comprehensive comparison between these estimation methods yet. Also, the
differences between vision and language scenarios make it doubtful whether
previous conclusions can be established across fields. In this paper, we first
conduct a thorough survey of existing transferability estimation methods being
able to find the most suitable model, then we conduct a detailed empirical
study for the surveyed methods based on the GLUE benchmark. From qualitative
and quantitative analyses, we demonstrate the strengths and weaknesses of
existing methods and show that H-Score generally performs well with
superiorities in effectiveness and efficiency. We also outline the difficulties
of consideration of training details, applicability to text generation, and
consistency to certain metrics which shed light on future directions.
Related papers
- Likelihood as a Performance Gauge for Retrieval-Augmented Generation [78.28197013467157]
We show that likelihoods serve as an effective gauge for language model performance.
We propose two methods that use question likelihood as a gauge for selecting and constructing prompts that lead to better performance.
arXiv Detail & Related papers (2024-11-12T13:14:09Z) - Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a time series forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context.
We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters.
Our experiments highlight the importance of incorporating contextual information, demonstrate surprising performance when using LLM-based forecasting models, and also reveal some of their critical shortcomings.
arXiv Detail & Related papers (2024-10-24T17:56:08Z) - On Uncertainty In Natural Language Processing [2.5076643086429993]
This thesis studies how uncertainty in natural language processing can be characterized from a linguistic, statistical and neural perspective.
We propose a method for calibrated sampling in natural language generation based on non-exchangeable conformal prediction.
Lastly, we develop an approach to quantify confidence in large black-box language models using auxiliary predictors.
arXiv Detail & Related papers (2024-10-04T14:08:02Z) - Training on the Test Task Confounds Evaluation and Emergence [16.32378359459614]
We show that training on the test task confounds both relative model evaluations and claims about emergent capabilities.
We propose an effective method to adjust for training on the test task by fine-tuning each model under comparison on the same task-relevant data before evaluation.
arXiv Detail & Related papers (2024-07-10T17:57:58Z) - A step towards the integration of machine learning and small area
estimation [0.0]
We propose a predictor supported by machine learning algorithms which can be used to predict any population or subpopulation characteristics.
We study only small departures from the assumed model, to show that our proposal is a good alternative in this case as well.
What is more, we propose the method of the accuracy estimation of machine learning predictors, giving the possibility of the accuracy comparison with classic methods.
arXiv Detail & Related papers (2024-02-12T09:43:17Z) - Robust Visual Question Answering: Datasets, Methods, and Future
Challenges [23.59923999144776]
Visual question answering requires a system to provide an accurate natural language answer given an image and a natural language question.
Previous generic VQA methods often exhibit a tendency to memorize biases present in the training data rather than learning proper behaviors, such as grounding images before predicting answers.
Various datasets and debiasing methods have been proposed to evaluate and enhance the VQA robustness, respectively.
arXiv Detail & Related papers (2023-07-21T10:12:09Z) - Fairness-guided Few-shot Prompting for Large Language Models [93.05624064699965]
In-context learning can suffer from high instability due to variations in training examples, example order, and prompt formats.
We introduce a metric to evaluate the predictive bias of a fixed prompt against labels or a given attributes.
We propose a novel search strategy based on the greedy search to identify the near-optimal prompt for improving the performance of in-context learning.
arXiv Detail & Related papers (2023-03-23T12:28:25Z) - Improving Pre-trained Language Model Fine-tuning with Noise Stability
Regularization [94.4409074435894]
We propose a novel and effective fine-tuning framework, named Layerwise Noise Stability Regularization (LNSR)
Specifically, we propose to inject the standard Gaussian noise and regularize hidden representations of the fine-tuned model.
We demonstrate the advantages of the proposed method over other state-of-the-art algorithms including L2-SP, Mixout and SMART.
arXiv Detail & Related papers (2022-06-12T04:42:49Z) - Self-Normalized Importance Sampling for Neural Language Modeling [97.96857871187052]
In this work, we propose self-normalized importance sampling. Compared to our previous work, the criteria considered in this work are self-normalized and there is no need to further conduct a correction step.
We show that our proposed self-normalized importance sampling is competitive in both research-oriented and production-oriented automatic speech recognition tasks.
arXiv Detail & Related papers (2021-11-11T16:57:53Z) - NoiER: An Approach for Training more Reliable Fine-TunedDownstream Task
Models [54.184609286094044]
We propose noise entropy regularisation (NoiER) as an efficient learning paradigm that solves the problem without auxiliary models and additional data.
The proposed approach improved traditional OOD detection evaluation metrics by 55% on average compared to the original fine-tuned models.
arXiv Detail & Related papers (2021-08-29T06:58:28Z) - Language Model Evaluation in Open-ended Text Generation [0.76146285961466]
We study different evaluation metrics that have been proposed to evaluate quality, diversity and consistency of machine-generated text.
From there, we propose a practical pipeline to evaluate language models in open-ended generation task.
arXiv Detail & Related papers (2021-08-08T06:16:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.