Dialectical language model evaluation: An initial appraisal of the
commonsense spatial reasoning abilities of LLMs
- URL: http://arxiv.org/abs/2304.11164v1
- Date: Sat, 22 Apr 2023 06:28:46 GMT
- Title: Dialectical language model evaluation: An initial appraisal of the
commonsense spatial reasoning abilities of LLMs
- Authors: Anthony G Cohn, Jose Hernandez-Orallo
- Abstract summary: We explore an alternative dialectical evaluation of language models for commonsense reasoning.
The goal of this kind of evaluation is not to obtain an aggregate performance value but to find failures and map the boundaries of the system.
In this paper we conduct some qualitative investigations of this kind of evaluation for the particular case of spatial reasoning.
- Score: 10.453404263936335
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
Related papers
- Reasoning Elicitation in Language Models via Counterfactual Feedback [17.908819732623716]
We derive novel metrics that balance accuracy in factual and counterfactual questions.
We propose several fine-tuning approaches that aim to elicit better reasoning mechanisms.
We evaluate the performance of the fine-tuned language models in a variety of realistic scenarios.
arXiv Detail & Related papers (2024-10-02T15:33:30Z) - Lessons from the Trenches on Reproducible Evaluation of Language Models [60.522749986793094]
We draw on three years of experience in evaluating large language models to provide guidance and lessons for researchers.
We present the Language Model Evaluation Harness (lm-eval), an open source library for independent, reproducible, and evaluation of language models.
arXiv Detail & Related papers (2024-05-23T16:50:49Z) - Towards Personalized Evaluation of Large Language Models with An
Anonymous Crowd-Sourcing Platform [64.76104135495576]
We propose a novel anonymous crowd-sourcing evaluation platform, BingJian, for large language models.
Through this platform, users have the opportunity to submit their questions, testing the models on a personalized and potentially broader range of capabilities.
arXiv Detail & Related papers (2024-03-13T07:31:20Z) - Pseudointelligence: A Unifying Framework for Language Model Evaluation [14.95543156914676]
We propose a complexity-theoretic framework of model evaluation cast as a dynamic interaction between a model and a learned evaluator.
We demonstrate that this framework can be used to reason about two case studies in language model evaluation, as well as analyze existing evaluation methods.
arXiv Detail & Related papers (2023-10-18T17:48:05Z) - Establishing Trustworthiness: Rethinking Tasks and Model Evaluation [36.329415036660535]
We argue that it is time to rethink what constitutes tasks and model evaluation in NLP.
We review existing compartmentalized approaches for understanding the origins of a model's functional capacity.
arXiv Detail & Related papers (2023-10-09T06:32:10Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - Improving Factuality and Reasoning in Language Models through Multiagent
Debate [95.10641301155232]
We present a complementary approach to improve language responses where multiple language model instances propose and debate their individual responses and reasoning processes over multiple rounds to arrive at a common final answer.
Our findings indicate that this approach significantly enhances mathematical and strategic reasoning across a number of tasks.
Our approach may be directly applied to existing black-box models and uses identical procedure and prompts for all tasks we investigate.
arXiv Detail & Related papers (2023-05-23T17:55:11Z) - Curriculum: A Broad-Coverage Benchmark for Linguistic Phenomena in
Natural Language Understanding [1.827510863075184]
Curriculum is a new format of NLI benchmark for evaluation of broad-coverage linguistic phenomena.
We show that this linguistic-phenomena-driven benchmark can serve as an effective tool for diagnosing model behavior and verifying model learning quality.
arXiv Detail & Related papers (2022-04-13T10:32:03Z) - Just Rank: Rethinking Evaluation with Word and Sentence Similarities [105.5541653811528]
intrinsic evaluation for embeddings lags far behind, and there has been no significant update since the past decade.
This paper first points out the problems using semantic similarity as the gold standard for word and sentence embedding evaluations.
We propose a new intrinsic evaluation method called EvalRank, which shows a much stronger correlation with downstream tasks.
arXiv Detail & Related papers (2022-03-05T08:40:05Z) - On the Use of Linguistic Features for the Evaluation of Generative
Dialogue Systems [17.749995931459136]
We propose that a metric based on linguistic features may be able to maintain good correlation with human judgment and be interpretable.
To support this proposition, we measure and analyze various linguistic features on dialogues produced by multiple dialogue models.
We find that the features' behaviour is consistent with the known properties of the models tested, and is similar across domains.
arXiv Detail & Related papers (2021-04-13T16:28:00Z) - Curious Case of Language Generation Evaluation Metrics: A Cautionary
Tale [52.663117551150954]
A few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation.
This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them.
In this paper, we urge the community for more careful consideration of how they automatically evaluate their models.
arXiv Detail & Related papers (2020-10-26T13:57:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.