Separating form and meaning: Using self-consistency to quantify task
understanding across multiple senses
- URL: http://arxiv.org/abs/2305.11662v3
- Date: Wed, 20 Dec 2023 12:57:34 GMT
- Title: Separating form and meaning: Using self-consistency to quantify task
understanding across multiple senses
- Authors: Xenia Ohmer, Elia Bruni, Dieuwke Hupkes
- Abstract summary: We propose a novel paradigm for evaluating large language models (LLMs)
We measure understanding not in terms of correctness but by evaluating consistency across multiple senses that are generated by the model itself.
Our approach does not require any static evaluation corpora in languages other than English.
- Score: 14.784624121891328
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: At the staggering pace with which the capabilities of large language models
(LLMs) are increasing, creating future-proof evaluation sets to assess their
understanding becomes more and more challenging. In this paper, we propose a
novel paradigm for evaluating LLMs which leverages the idea that correct world
understanding should be consistent across different (Fregean) senses of the
same meaning. Accordingly, we measure understanding not in terms of correctness
but by evaluating consistency across multiple senses that are generated by the
model itself. We showcase our approach by instantiating a test where the
different senses are different languages, hence using multilingual
self-consistency as a litmus test for the model's understanding and
simultaneously addressing the important topic of multilinguality. Taking one of
the latest versions of ChatGPT as our object of study, we evaluate multilingual
consistency for two different tasks across three different languages. We show
that its multilingual consistency is still lacking, and that its task and world
understanding are thus not language-independent. As our approach does not
require any static evaluation corpora in languages other than English, it can
easily and cheaply be extended to different languages and tasks and could
become an integral part of future benchmarking efforts.
Related papers
- Evaluating Knowledge-based Cross-lingual Inconsistency in Large Language Models [16.942897938964638]
Large Language Models (LLMs) have shown exceptional performance in various Natural Language Processing (NLP) tasks.
Despite their successes, these models often exhibit significant inconsistencies when processing the same concepts across different languages.
This study focuses on three primary questions: the existence of cross-lingual inconsistencies in LLMs, the specific aspects in which these inconsistencies manifest, and the correlation between cross-lingual consistency and multilingual capabilities of LLMs.
arXiv Detail & Related papers (2024-07-01T15:11:37Z) - From Form(s) to Meaning: Probing the Semantic Depths of Language Models Using Multisense Consistency [13.154753046052527]
We focus on consistency across languages as well as paraphrases.
We find that the model's multisense consistency is lacking and run several follow-up analyses to verify.
We conclude that, in this aspect, the understanding of LLMs is still quite far from being consistent and human-like.
arXiv Detail & Related papers (2024-04-18T12:48:17Z) - Analyzing the Mono- and Cross-Lingual Pretraining Dynamics of
Multilingual Language Models [73.11488464916668]
This study investigates the dynamics of the multilingual pretraining process.
We probe checkpoints taken from throughout XLM-R pretraining, using a suite of linguistic tasks.
Our analysis shows that the model achieves high in-language performance early on, with lower-level linguistic skills acquired before more complex ones.
arXiv Detail & Related papers (2022-05-24T03:35:00Z) - Cross-lingual Lifelong Learning [53.06904052325966]
We present a principled Cross-lingual Continual Learning (CCL) evaluation paradigm.
We provide insights into what makes multilingual sequential learning particularly challenging.
The implications of this analysis include a recipe for how to measure and balance different cross-lingual continual learning desiderata.
arXiv Detail & Related papers (2022-05-23T09:25:43Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Language Modelling as a Multi-Task Problem [12.48699285085636]
We investigate whether language models adhere to learning principles of multi-task learning during training.
Experiments demonstrate that a multi-task setting naturally emerges within the objective of the more general task of language modelling.
arXiv Detail & Related papers (2021-01-27T09:47:42Z) - Meta-Learning for Effective Multi-task and Multilingual Modelling [23.53779501937046]
We propose a meta-learning approach to learn the interactions between both tasks and languages.
We present experiments on five different tasks and six different languages from the XTREME multilingual benchmark dataset.
arXiv Detail & Related papers (2021-01-25T19:30:26Z) - XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating
Cross-lingual Generalization [128.37244072182506]
Cross-lingual TRansfer Evaluation of Multilinguals XTREME is a benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks.
We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models.
arXiv Detail & Related papers (2020-03-24T19:09:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.