Reasoning or Simply Next Token Prediction? A Benchmark for Stress-Testing Large Language Models
- URL: http://arxiv.org/abs/2406.15468v1
- Date: Sat, 15 Jun 2024 05:35:47 GMT
- Title: Reasoning or Simply Next Token Prediction? A Benchmark for Stress-Testing Large Language Models
- Authors: Wentian Wang, Paul Kantor, Jacob Feldman, Lazaros Gallos, Hao Wang,
- Abstract summary: We propose MMLU-SR, a novel dataset designed to measure the true comprehension abilities of Large Language Models (LLMs)
We modified standardized test questions by replacing a key term with a dummy word along with its definition.
We found a substantial reduction in model performance after such replacement, suggesting poor comprehension.
- Score: 3.6596632313607658
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose MMLU-SR, a novel dataset designed to measure the true comprehension abilities of Large Language Models (LLMs) by challenging their performance in question-answering tasks with modified terms. We reasoned that an agent that ``truly'' understands a concept can still evaluate it when key terms are replaced by suitably defined alternate terms, and sought to differentiate such comprehension from mere text replacement. In our study, we modified standardized test questions by replacing a key term with a dummy word along with its definition. The key term could be in the context of questions, answers, or both questions and answers. Notwithstanding the high scores achieved by recent popular LLMs on the MMLU leaderboard, we found a substantial reduction in model performance after such replacement, suggesting poor comprehension. This new benchmark provides a rigorous benchmark for testing true model comprehension, and poses a challenge to the broader scientific community.
Related papers
- Revisiting a Pain in the Neck: Semantic Phrase Processing Benchmark for Language Models [10.482557806309174]
We introduce LexBench, a comprehensive evaluation suite enabled to test language models (LMs) on semantic phrase processing tasks.
Thanks to ourbenchmark, we assess the performance of 15 LMs across model architectures and parameter scales in classification, extraction, and interpretation tasks.
Our benchmarking findings can serve future research aiming to improve the generic capability of LMs on semantic phrase comprehension.
arXiv Detail & Related papers (2024-05-05T09:20:38Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - SCREWS: A Modular Framework for Reasoning with Revisions [58.698199183147935]
We present SCREWS, a modular framework for reasoning with revisions.
We show that SCREWS unifies several previous approaches under a common framework.
We evaluate our framework with state-of-the-art LLMs on a diverse set of reasoning tasks.
arXiv Detail & Related papers (2023-09-20T15:59:54Z) - Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models [27.491408293411734]
Large Language Models (LLMs) show promising results in language generation and instruction following but frequently "hallucinate"
Our research introduces a simple redundancy: not all tokens in auto-regressive text equally represent the underlying meaning.
arXiv Detail & Related papers (2023-07-03T22:17:16Z) - Towards preserving word order importance through Forced Invalidation [80.33036864442182]
We show that pre-trained language models are insensitive to word order.
We propose Forced Invalidation to help preserve the importance of word order.
Our experiments demonstrate that Forced Invalidation significantly improves the sensitivity of the models to word order.
arXiv Detail & Related papers (2023-04-11T13:42:10Z) - LMentry: A Language Model Benchmark of Elementary Language Tasks [39.71352171304755]
LMentry is a benchmark that focuses on a compact set of tasks that are trivial to humans.
It provides insights into the capabilities and robustness of large language models.
arXiv Detail & Related papers (2022-11-03T18:01:12Z) - WinoDict: Probing language models for in-context word acquisition [32.81587292382359]
We introduce a new in-context learning paradigm to measure Large Language Models' (LLMs) ability to learn novel words during inference.
We show that the accuracy of LLMs compared to the original Winograd tasks decreases radically in our benchmark.
arXiv Detail & Related papers (2022-09-25T05:30:13Z) - Masked Language Modeling and the Distributional Hypothesis: Order Word
Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines.
In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics.
Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z) - A Brief Survey and Comparative Study of Recent Development of Pronoun
Coreference Resolution [55.39835612617972]
Pronoun Coreference Resolution (PCR) is the task of resolving pronominal expressions to all mentions they refer to.
As one important natural language understanding (NLU) component, pronoun resolution is crucial for many downstream tasks and still challenging for existing models.
We conduct extensive experiments to show that even though current models are achieving good performance on the standard evaluation set, they are still not ready to be used in real applications.
arXiv Detail & Related papers (2020-09-27T01:40:01Z) - FAT ALBERT: Finding Answers in Large Texts using Semantic Similarity
Attention Layer based on BERT [0.5772546394254112]
We develop a model based on BERT, a state-of-the-art transformer network.
We are ranked first in the leader board with test accuracy of 87.79%.
arXiv Detail & Related papers (2020-08-22T08:04:21Z) - ManyModalQA: Modality Disambiguation and QA over Diverse Inputs [73.93607719921945]
We present a new multimodal question answering challenge, ManyModalQA, in which an agent must answer a question by considering three distinct modalities.
We collect our data by scraping Wikipedia and then utilize crowdsourcing to collect question-answer pairs.
arXiv Detail & Related papers (2020-01-22T14:39:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.