Establishing Trustworthiness: Rethinking Tasks and Model Evaluation
- URL: http://arxiv.org/abs/2310.05442v2
- Date: Mon, 23 Oct 2023 14:43:40 GMT
- Title: Establishing Trustworthiness: Rethinking Tasks and Model Evaluation
- Authors: Robert Litschko, Max M\"uller-Eberstein, Rob van der Goot, Leon Weber,
Barbara Plank
- Abstract summary: We argue that it is time to rethink what constitutes tasks and model evaluation in NLP.
We review existing compartmentalized approaches for understanding the origins of a model's functional capacity.
- Score: 36.329415036660535
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language understanding is a multi-faceted cognitive capability, which the
Natural Language Processing (NLP) community has striven to model
computationally for decades. Traditionally, facets of linguistic intelligence
have been compartmentalized into tasks with specialized model architectures and
corresponding evaluation protocols. With the advent of large language models
(LLMs) the community has witnessed a dramatic shift towards general purpose,
task-agnostic approaches powered by generative models. As a consequence, the
traditional compartmentalized notion of language tasks is breaking down,
followed by an increasing challenge for evaluation and analysis. At the same
time, LLMs are being deployed in more real-world scenarios, including
previously unforeseen zero-shot setups, increasing the need for trustworthy and
reliable systems. Therefore, we argue that it is time to rethink what
constitutes tasks and model evaluation in NLP, and pursue a more holistic view
on language, placing trustworthiness at the center. Towards this goal, we
review existing compartmentalized approaches for understanding the origins of a
model's functional capacity, and provide recommendations for more multi-faceted
evaluation protocols.
Related papers
- Trustworthy Alignment of Retrieval-Augmented Large Language Models via Reinforcement Learning [84.94709351266557]
We focus on the trustworthiness of language models with respect to retrieval augmentation.
We deem that retrieval-augmented language models have the inherent capabilities of supplying response according to both contextual and parametric knowledge.
Inspired by aligning language models with human preference, we take the first step towards aligning retrieval-augmented language models to a status where it responds relying merely on the external evidence.
arXiv Detail & Related papers (2024-10-22T09:25:21Z) - MoE-CT: A Novel Approach For Large Language Models Training With Resistance To Catastrophic Forgetting [53.77590764277568]
We introduce a novel MoE-CT architecture that separates the base model's learning from the multilingual expansion process.
Our design freezes the original LLM parameters, thus safeguarding its performance in high-resource languages, while an appended MoE module, trained on diverse language datasets, augments low-resource language proficiency.
arXiv Detail & Related papers (2024-06-25T11:03:45Z) - Collective Constitutional AI: Aligning a Language Model with Public Input [20.95333081841239]
There is growing consensus that language model (LM) developers should not be the sole deciders of LM behavior.
We present Collective Constitutional AI (CCAI): a multi-stage process for sourcing and integrating public input into LMs.
We demonstrate the real-world practicality of this approach by creating what is, to our knowledge, the first LM fine-tuned with collectively sourced public input.
arXiv Detail & Related papers (2024-06-12T02:20:46Z) - Scalable Language Model with Generalized Continual Learning [58.700439919096155]
The Joint Adaptive Re-ization (JARe) is integrated with Dynamic Task-related Knowledge Retrieval (DTKR) to enable adaptive adjustment of language models based on specific downstream tasks.
Our method demonstrates state-of-the-art performance on diverse backbones and benchmarks, achieving effective continual learning in both full-set and few-shot scenarios with minimal forgetting.
arXiv Detail & Related papers (2024-04-11T04:22:15Z) - L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs)
We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods.
In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z) - Commonsense Knowledge Transfer for Pre-trained Language Models [83.01121484432801]
We introduce commonsense knowledge transfer, a framework to transfer the commonsense knowledge stored in a neural commonsense knowledge model to a general-purpose pre-trained language model.
It first exploits general texts to form queries for extracting commonsense knowledge from the neural commonsense knowledge model.
It then refines the language model with two self-supervised objectives: commonsense mask infilling and commonsense relation prediction.
arXiv Detail & Related papers (2023-06-04T15:44:51Z) - Dialectical language model evaluation: An initial appraisal of the
commonsense spatial reasoning abilities of LLMs [10.453404263936335]
We explore an alternative dialectical evaluation of language models for commonsense reasoning.
The goal of this kind of evaluation is not to obtain an aggregate performance value but to find failures and map the boundaries of the system.
In this paper we conduct some qualitative investigations of this kind of evaluation for the particular case of spatial reasoning.
arXiv Detail & Related papers (2023-04-22T06:28:46Z) - VALSE: A Task-Independent Benchmark for Vision and Language Models
Centered on Linguistic Phenomena [15.984927623688915]
VALSE (Vision And Language Structured Evaluation) is a novel benchmark for testing general-purpose pretrained vision and language (V&L) models.
VALSE offers a suite of six tests covering various linguistic constructs.
We build VALSE using methods that support the construction of valid foils, and report results from evaluating five widely-used V&L models.
arXiv Detail & Related papers (2021-12-14T17:15:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.