Evidence > Intuition: Transferability Estimation for Encoder Selection
- URL: http://arxiv.org/abs/2210.11255v1
- Date: Thu, 20 Oct 2022 13:25:21 GMT
- Title: Evidence > Intuition: Transferability Estimation for Encoder Selection
- Authors: Elisa Bassignana and Max M\"uller-Eberstein and Mike Zhang and Barbara
Plank
- Abstract summary: We generate quantitative evidence to predict which LM will perform best on a target task without having to fine-tune all candidates.
We adopt the state-of-the-art Logarithm Maximum of Evidence (LogME) measure from Computer Vision (CV) and find that it positively correlates with final LM performance in 94% of setups.
- Score: 16.490047604583882
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the increase in availability of large pre-trained language models (LMs)
in Natural Language Processing (NLP), it becomes critical to assess their fit
for a specific target task a priori - as fine-tuning the entire space of
available LMs is computationally prohibitive and unsustainable. However,
encoder transferability estimation has received little to no attention in NLP.
In this paper, we propose to generate quantitative evidence to predict which
LM, out of a pool of models, will perform best on a target task without having
to fine-tune all candidates. We provide a comprehensive study on LM ranking for
10 NLP tasks spanning the two fundamental problem types of classification and
structured prediction. We adopt the state-of-the-art Logarithm of Maximum
Evidence (LogME) measure from Computer Vision (CV) and find that it positively
correlates with final LM performance in 94% of the setups. In the first study
of its kind, we further compare transferability measures with the de facto
standard of human practitioner ranking, finding that evidence from quantitative
metrics is more robust than pure intuition and can help identify unexpected LM
candidates.
Related papers
- Aggregation Artifacts in Subjective Tasks Collapse Large Language Models' Posteriors [74.04775677110179]
In-context Learning (ICL) has become the primary method for performing natural language tasks with Large Language Models (LLMs)
In this work, we examine whether this is the result of the aggregation used in corresponding datasets, where trying to combine low-agreement, disparate annotations might lead to annotation artifacts that create detrimental noise in the prompt.
Our results indicate that aggregation is a confounding factor in the modeling of subjective tasks, and advocate focusing on modeling individuals instead.
arXiv Detail & Related papers (2024-10-17T17:16:00Z) - Latent Causal Probing: A Formal Perspective on Probing with Causal Models of Data [3.376269351435396]
We develop a formal perspective on probing using structural causal models (SCM)
We extend a recent study of LMs in the context of a synthetic grid-world navigation task.
Our techniques provide robust empirical evidence for the ability of LMs to induce the latent concepts underlying text.
arXiv Detail & Related papers (2024-07-18T17:59:27Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - Quantifying Prediction Consistency Under Model Multiplicity in Tabular LLMs [10.494477811252034]
Fine-tuning large language models can lead to textitfine-tuning multiplicity, where equally well-performing models make conflicting predictions on the same inputs.
This raises critical concerns about the robustness and reliability of Tabular LLMs.
This work proposes a novel metric to quantify the robustness of individual predictions without expensive model retraining.
arXiv Detail & Related papers (2024-07-04T22:22:09Z) - Protecting Privacy Through Approximating Optimal Parameters for Sequence Unlearning in Language Models [37.172662930947446]
Language models (LMs) are potentially vulnerable to extraction attacks, which represent a significant privacy risk.
We propose Privacy Protection via Optimal Parameters (POP), a novel unlearning method that effectively forgets the target token sequences from the pretrained LM.
POP exhibits remarkable retention performance post-unlearning across 9 classification and 4 dialogue benchmarks, outperforming the state-of-the-art by a large margin.
arXiv Detail & Related papers (2024-06-20T08:12:49Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - Can Small Language Models Help Large Language Models Reason Better?: LM-Guided Chain-of-Thought [51.240387516059535]
We introduce a novel framework, LM-Guided CoT, that leverages a lightweight (i.e., 1B) language model (LM) for guiding a black-box large (i.e., >10B) LM in reasoning tasks.
We optimize the model through 1) knowledge distillation and 2) reinforcement learning from rationale-oriented and task-oriented reward signals.
arXiv Detail & Related papers (2024-04-04T12:46:37Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - An Interpretability Evaluation Benchmark for Pre-trained Language Models [37.16893581395874]
We propose a novel evaluation benchmark providing with both English and Chinese annotated data.
It tests LMs abilities in multiple dimensions, i.e., grammar, semantics, knowledge, reasoning and computation.
It contains perturbed instances for each original instance, so as to use the rationale consistency under perturbations as the metric for faithfulness.
arXiv Detail & Related papers (2022-07-28T08:28:09Z) - Sort by Structure: Language Model Ranking as Dependency Probing [25.723591566201343]
Making an informed choice of pre-trained language model (LM) is critical for performance, yet environmentally costly, and as such widely underexplored.
We propose probing to rank LMs, specifically for parsing dependencies in a given language, by measuring the degree to which labeled trees are recoverable from an LM's contextualized embeddings.
Across 46 typologically and architecturally diverse LM-language pairs, our approach predicts the best LM choice of 79% of orders of less compute than training a full magnitude of orders of less compute.
arXiv Detail & Related papers (2022-06-10T08:10:29Z) - oLMpics -- On what Language Model Pre-training Captures [84.60594612120173]
We propose eight reasoning tasks, which require operations such as comparison, conjunction, and composition.
A fundamental challenge is to understand whether the performance of a LM on a task should be attributed to the pre-trained representations or to the process of fine-tuning on the task data.
arXiv Detail & Related papers (2019-12-31T12:11:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.