Zero-Shot Commonsense Validation and Reasoning with Large Language Models: An Evaluation on SemEval-2020 Task 4 Dataset
- URL: http://arxiv.org/abs/2502.15810v1
- Date: Wed, 19 Feb 2025 12:40:49 GMT
- Title: Zero-Shot Commonsense Validation and Reasoning with Large Language Models: An Evaluation on SemEval-2020 Task 4 Dataset
- Authors: Rawand Alfugaha, Mohammad AL-Smadi,
- Abstract summary: This study evaluates the performance of Large Language Models (LLMs) on SemEval-2020 Task 4 dataset.<n>The models are tested on two tasks: Task A (Commonsense Validation), where models determine whether a statement aligns with commonsense knowledge, and Task B (Commonsense Explanation)<n>Results indicate that larger models outperform previous models, with LLaMA3-70B achieving the highest accuracy of 98.40% in Task A whereas, lagging behind previous models with 93.40% in Task B.
- Score: 0.16385815610837165
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This study evaluates the performance of Large Language Models (LLMs) on SemEval-2020 Task 4 dataset, focusing on commonsense validation and explanation. Our methodology involves evaluating multiple LLMs, including LLaMA3-70B, Gemma2-9B, and Mixtral-8x7B, using zero-shot prompting techniques. The models are tested on two tasks: Task A (Commonsense Validation), where models determine whether a statement aligns with commonsense knowledge, and Task B (Commonsense Explanation), where models identify the reasoning behind implausible statements. Performance is assessed based on accuracy, and results are compared to fine-tuned transformer-based models. The results indicate that larger models outperform previous models and perform closely to human evaluation for Task A, with LLaMA3-70B achieving the highest accuracy of 98.40% in Task A whereas, lagging behind previous models with 93.40% in Task B. However, while models effectively identify implausible statements, they face challenges in selecting the most relevant explanation, highlighting limitations in causal and inferential reasoning.
Related papers
- LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning [76.82159851648711]
We propose a framework that dynamically improves the embedding model's representation learning for negative pairs.
LLaVE establishes stronger baselines that achieve state-of-the-art (SOTA) performance.
LLaVE can generalize to text-video retrieval tasks in a zero-shot manner and achieve strong performance.
arXiv Detail & Related papers (2025-03-04T10:21:57Z) - Error-driven Data-efficient Large Multimodal Model Tuning [35.20400815089843]
Large Multimodal Models (LMMs) have demonstrated impressive performance across numerous academic benchmarks.<n>We propose an error-driven data-efficient tuning framework that aims to efficiently adapt generic LMMs to newly emerging tasks.
arXiv Detail & Related papers (2024-12-20T08:07:11Z) - Sign of the Times: Evaluating the use of Large Language Models for Idiomaticity Detection [2.2724928083094196]
This work looks at the performance of a range of LLMs on three idiomaticity datasets: SemEval 2022 Task 2a, FLUTE, and MAGPIE.
We find that whilst these models do give competitive performance, they do not match the results of fine-tuned task-specific models, even at the largest scales.
arXiv Detail & Related papers (2024-05-15T11:55:14Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - Common 7B Language Models Already Possess Strong Math Capabilities [61.61442513067561]
This paper shows that the LLaMA-2 7B model with common pre-training already exhibits strong mathematical abilities.
The potential for extensive scaling is constrained by the scarcity of publicly available math questions.
arXiv Detail & Related papers (2024-03-07T18:00:40Z) - Discover, Explanation, Improvement: An Automatic Slice Detection
Framework for Natural Language Processing [72.14557106085284]
slice detection models (SDM) automatically identify underperforming groups of datapoints.
This paper proposes a benchmark named "Discover, Explain, improve (DEIM)" for classification NLP tasks.
Our evaluation shows that Edisa can accurately select error-prone datapoints with informative semantic features.
arXiv Detail & Related papers (2022-11-08T19:00:00Z) - Model-Agnostic Multitask Fine-tuning for Few-shot Vision-Language
Transfer Learning [59.38343286807997]
We propose Model-Agnostic Multitask Fine-tuning (MAMF) for vision-language models on unseen tasks.
Compared with model-agnostic meta-learning (MAML), MAMF discards the bi-level optimization and uses only first-order gradients.
We show that MAMF consistently outperforms the classical fine-tuning method for few-shot transfer learning on five benchmark datasets.
arXiv Detail & Related papers (2022-03-09T17:26:53Z) - ZJUKLAB at SemEval-2021 Task 4: Negative Augmentation with Language
Model for Reading Comprehension of Abstract Meaning [16.151203366447962]
We explain the algorithms used to learn our models and the process of tuning the algorithms and selecting the best model.
Inspired by the similarity of the ReCAM task and the language pre-training, we propose a simple yet effective technology, namely, negative augmentation with language model.
Our models achieve the 4th rank on both official test sets of Subtask 1 and Subtask 2 with an accuracy of 87.9% and an accuracy of 92.8%, respectively.
arXiv Detail & Related papers (2021-02-25T13:03:05Z) - When Can Models Learn From Explanations? A Formal Framework for
Understanding the Roles of Explanation Data [84.87772675171412]
We study the circumstances under which explanations of individual data points can improve modeling performance.
We make use of three existing datasets with explanations: e-SNLI, TACRED, SemEval.
arXiv Detail & Related papers (2021-02-03T18:57:08Z) - Models, Pixels, and Rewards: Evaluating Design Trade-offs in Visual
Model-Based Reinforcement Learning [109.74041512359476]
We study a number of design decisions for the predictive model in visual MBRL algorithms.
We find that a range of design decisions that are often considered crucial, such as the use of latent spaces, have little effect on task performance.
We show how this phenomenon is related to exploration and how some of the lower-scoring models on standard benchmarks will perform the same as the best-performing models when trained on the same training data.
arXiv Detail & Related papers (2020-12-08T18:03:21Z) - QiaoNing at SemEval-2020 Task 4: Commonsense Validation and Explanation
system based on ensemble of language model [2.728575246952532]
In this paper, we present language model system submitted to SemEval-2020 Task 4 competition: "Commonsense Validation and Explanation"
We implemented with transfer learning using pretrained language models (BERT, XLNet, RoBERTa, and ALBERT) and fine-tune them on this task.
The ensembled model better solves this problem, making the model's accuracy reached 95.9% on subtask A, which just worse than human's by only 3% accuracy.
arXiv Detail & Related papers (2020-09-06T05:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.