Abdelhak at SemEval-2024 Task 9 : Decoding Brainteasers, The Efficacy of
Dedicated Models Versus ChatGPT
- URL: http://arxiv.org/abs/2403.00809v1
- Date: Sat, 24 Feb 2024 20:00:03 GMT
- Title: Abdelhak at SemEval-2024 Task 9 : Decoding Brainteasers, The Efficacy of
Dedicated Models Versus ChatGPT
- Authors: Abdelhak Kelious, Mounir Okirim
- Abstract summary: This study introduces a dedicated model aimed at solving the BRAINTEASER task 9.
A novel challenge designed to assess models lateral thinking capabilities through sentence and word puzzles.
Our model demonstrates remarkable efficacy, securing Rank 1 in sentence puzzle solving during the test phase with an overall score of 0.98.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study introduces a dedicated model aimed at solving the BRAINTEASER task
9 , a novel challenge designed to assess models lateral thinking capabilities
through sentence and word puzzles. Our model demonstrates remarkable efficacy,
securing Rank 1 in sentence puzzle solving during the test phase with an
overall score of 0.98. Additionally, we explore the comparative performance of
ChatGPT, specifically analyzing how variations in temperature settings affect
its ability to engage in lateral thinking and problem-solving. Our findings
indicate a notable performance disparity between the dedicated model and
ChatGPT, underscoring the potential of specialized approaches in enhancing
creative reasoning in AI.
Related papers
- Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse [9.542503507653494]
Chain-of-thought (CoT) has become a widely used strategy for working with large language and multimodal models.
We identify characteristics of tasks where CoT reduces performance by drawing inspiration from cognitive psychology.
We find that a diverse collection of state-of-the-art models exhibit significant drop-offs in performance when using inference-time reasoning.
arXiv Detail & Related papers (2024-10-27T18:30:41Z) - Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models [57.582219834039506]
We introduce the training methodologies implemented in the development of Skywork-MoE, a high-performance mixture-of-experts (MoE) large language model (LLM) with 146 billion parameters and 16 experts.
It is based on the pre-existing dense checkpoints of our Skywork-13B model.
arXiv Detail & Related papers (2024-06-03T03:58:41Z) - iREL at SemEval-2024 Task 9: Improving Conventional Prompting Methods for Brain Teasers [11.819814280565142]
This paper describes our approach for SemEval-2024 Task 9: BRAINTEASER: A Novel Task Defying Common Sense.
The BRAINTEASER task comprises multiple-choice Question Answering designed to evaluate the models' lateral thinking capabilities.
We propose a unique strategy to improve the performance of pre-trained language models in both subtasks.
arXiv Detail & Related papers (2024-05-25T08:50:51Z) - AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning [0.0]
SemEval 2024 BRAINTEASER task aims to test language models' capacity for divergent thinking.
We employ a holistic strategy by leveraging cutting-edge pre-trained models in multiple choice architecture.
Our approach achieve 92.5% accuracy in Sentence Puzzle subtask and 80.2% accuracy in Word Puzzle subtask.
arXiv Detail & Related papers (2024-05-16T18:26:38Z) - AILS-NTUA at SemEval-2024 Task 9: Cracking Brain Teasers: Transformer Models for Lateral Thinking Puzzles [1.9939549451457024]
This paper outlines our submission for the SemEval-2024 Task 9 competition: 'BRAINTEASER: A Novel Task Defying Common Sense'
We evaluate a plethora of pre-trained transformer-based language models of different sizes through fine-tuning.
Our top-performing approaches secured competitive positions on the competition leaderboard.
arXiv Detail & Related papers (2024-04-01T12:27:55Z) - Conceptual and Unbiased Reasoning in Language Models [98.90677711523645]
We propose a novel conceptualization framework that forces models to perform conceptual reasoning on abstract questions.
We show that existing large language models fall short on conceptual reasoning, dropping 9% to 28% on various benchmarks.
We then discuss how models can improve since high-level abstract reasoning is key to unbiased and generalizable decision-making.
arXiv Detail & Related papers (2024-03-30T00:53:53Z) - Spurious Feature Eraser: Stabilizing Test-Time Adaptation for Vision-Language Foundation Model [86.9619638550683]
Vision-language foundation models have exhibited remarkable success across a multitude of downstream tasks due to their scalability on extensive image-text paired data.
However, these models display significant limitations when applied to downstream tasks, such as fine-grained image classification, as a result of decision shortcuts''
arXiv Detail & Related papers (2024-03-01T09:01:53Z) - Advancing Spatial Reasoning in Large Language Models: An In-Depth
Evaluation and Enhancement Using the StepGame Benchmark [4.970614891967042]
We analyze GPT's spatial reasoning performance on the StepGame benchmark.
We identify proficiency in mapping natural language text to spatial relations but limitations in multi-hop reasoning.
We deploy Chain-of-thought and Tree-of-thoughts prompting strategies, offering insights into GPT's cognitive process"
arXiv Detail & Related papers (2024-01-08T16:13:08Z) - Diversifying the Mixture-of-Experts Representation for Language Models with Orthogonal Optimizer [59.43462055143123]
The Mixture of Experts (MoE) has emerged as a highly successful technique in deep learning.
In this study, we shed light on the homogeneous representation problem, wherein experts in the MoE fail to specialize and lack diversity.
We propose an alternating training strategy that encourages each expert to update in a direction to the subspace spanned by other experts.
arXiv Detail & Related papers (2023-10-15T07:20:28Z) - A Comprehensive Evaluation and Analysis Study for Chinese Spelling Check [53.152011258252315]
We show that using phonetic and graphic information reasonably is effective for Chinese Spelling Check.
Models are sensitive to the error distribution of the test set, which reflects the shortcomings of models.
The commonly used benchmark, SIGHAN, can not reliably evaluate models' performance.
arXiv Detail & Related papers (2023-07-25T17:02:38Z) - A Causal Framework to Quantify the Robustness of Mathematical Reasoning
with Language Models [81.15974174627785]
We study the behavior of language models in terms of robustness and sensitivity to direct interventions in the input space.
Our analysis shows that robustness does not appear to continuously improve as a function of size, but the GPT-3 Davinci models (175B) achieve a dramatic improvement in both robustness and sensitivity compared to all other GPT variants.
arXiv Detail & Related papers (2022-10-21T15:12:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.