Related papers: AILS-NTUA at SemEval-2024 Task 9: Cracking Brain Teasers: Transformer Models for Lateral Thinking Puzzles

AILS-NTUA at SemEval-2024 Task 9: Cracking Brain Teasers: Transformer Models for Lateral Thinking Puzzles

URL: http://arxiv.org/abs/2404.01084v1
Date: Mon, 1 Apr 2024 12:27:55 GMT
Title: AILS-NTUA at SemEval-2024 Task 9: Cracking Brain Teasers: Transformer Models for Lateral Thinking Puzzles
Authors: Ioannis Panagiotopoulos, Giorgos Filandrianos, Maria Lymperaiou, Giorgos Stamou,
Abstract summary: This paper outlines our submission for the SemEval-2024 Task 9 competition: 'BRAINTEASER: A Novel Task Defying Common Sense' We evaluate a plethora of pre-trained transformer-based language models of different sizes through fine-tuning. Our top-performing approaches secured competitive positions on the competition leaderboard.
Score: 1.9939549451457024
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we outline our submission for the SemEval-2024 Task 9 competition: 'BRAINTEASER: A Novel Task Defying Common Sense'. We engage in both sub-tasks: Sub-task A-Sentence Puzzle and Sub-task B-Word Puzzle. We evaluate a plethora of pre-trained transformer-based language models of different sizes through fine-tuning. Subsequently, we undertake an analysis of their scores and responses to aid future researchers in understanding and utilizing these models effectively. Our top-performing approaches secured competitive positions on the competition leaderboard across both sub-tasks. In the evaluation phase, our best submission attained an average accuracy score of 81.7% in the Sentence Puzzle, and 85.4% in the Word Puzzle, significantly outperforming the best neural baseline (ChatGPT) by more than 20% and 30% respectively.

Related papers

Towards Open-World Mobile Manipulation in Homes: Lessons from the Neurips 2023 HomeRobot Open Vocabulary Mobile Manipulation Challenge [93.4434417387526]
We propose Open Vocabulary Mobile Manipulation as a key benchmark task for robotics. We organized a NeurIPS 2023 competition featuring both simulation and real-world components to evaluate solutions to this task. We detail the results and methodologies used, both in simulation and real-world settings.
arXiv Detail & Related papers (2024-07-09T15:15:01Z)
BAMO at SemEval-2024 Task 9: BRAINTEASER: A Novel Task Defying Common Sense [0.04096453902709291]
This paper outlines our approach to SemEval 2024 Task 9, BRAINTEASER: A Novel Task Defying Common Sense. The dataset comprises multi-choice questions that challenge models to think "outside of the box" Our best method achieves an overall accuracy of 85 percent on the sentence puzzles subtask.
arXiv Detail & Related papers (2024-06-07T14:01:56Z)
iREL at SemEval-2024 Task 9: Improving Conventional Prompting Methods for Brain Teasers [11.819814280565142]
This paper describes our approach for SemEval-2024 Task 9: BRAINTEASER: A Novel Task Defying Common Sense. The BRAINTEASER task comprises multiple-choice Question Answering designed to evaluate the models' lateral thinking capabilities. We propose a unique strategy to improve the performance of pre-trained language models in both subtasks.
arXiv Detail & Related papers (2024-05-25T08:50:51Z)
AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning [0.0]
SemEval 2024 BRAINTEASER task aims to test language models' capacity for divergent thinking. We employ a holistic strategy by leveraging cutting-edge pre-trained models in multiple choice architecture. Our approach achieve 92.5% accuracy in Sentence Puzzle subtask and 80.2% accuracy in Word Puzzle subtask.
arXiv Detail & Related papers (2024-05-16T18:26:38Z)
Abdelhak at SemEval-2024 Task 9 : Decoding Brainteasers, The Efficacy of Dedicated Models Versus ChatGPT [0.0]
This study introduces a dedicated model aimed at solving the BRAINTEASER task 9. A novel challenge designed to assess models lateral thinking capabilities through sentence and word puzzles. Our model demonstrates remarkable efficacy, securing Rank 1 in sentence puzzle solving during the test phase with an overall score of 0.98.
arXiv Detail & Related papers (2024-02-24T20:00:03Z)
Little Giants: Exploring the Potential of Small LLMs as Evaluation Metrics in Summarization in the Eval4NLP 2023 Shared Task [53.163534619649866]
This paper focuses on assessing the effectiveness of prompt-based techniques to empower Large Language Models to handle the task of quality estimation. We conducted systematic experiments with various prompting techniques, including standard prompting, prompts informed by annotator instructions, and innovative chain-of-thought prompting. Our work reveals that combining these approaches using a "small", open source model (orca_mini_v3_7B) yields competitive results.
arXiv Detail & Related papers (2023-11-01T17:44:35Z)
Bag of Tricks for Effective Language Model Pretraining and Downstream Adaptation: A Case Study on GLUE [93.98660272309974]
This report briefly describes our submission Vega v1 on the General Language Understanding Evaluation leaderboard. GLUE is a collection of nine natural language understanding tasks, including question answering, linguistic acceptability, sentiment analysis, text similarity, paraphrase detection, and natural language inference. With our optimized pretraining and fine-tuning strategies, our 1.3 billion model sets new state-of-the-art on 4/9 tasks, achieving the best average score of 91.3.
arXiv Detail & Related papers (2023-02-18T09:26:35Z)
Effective Cross-Task Transfer Learning for Explainable Natural Language Inference with T5 [50.574918785575655]
We compare sequential fine-tuning with a model for multi-task learning in the context of boosting performance on two tasks. Our results show that while sequential multi-task learning can be tuned to be good at the first of two target tasks, it performs less well on the second and additionally struggles with overfitting.
arXiv Detail & Related papers (2022-10-31T13:26:08Z)
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them [108.54545521369688]
We focus on a suite of 23 challenging BIG-Bench tasks which we call BIG-Bench Hard (BBH) We find that applying chain-of-thought (CoT) prompting to BBH tasks enables PaLM to surpass the average human-rater performance on 10 of the 23 tasks, and Codex to surpass the average human-rater performance on 17 of the 23 tasks.
arXiv Detail & Related papers (2022-10-17T17:08:26Z)
Retrospective on the 2021 BASALT Competition on Learning from Human Feedback [92.37243979045817]
The goal of the competition was to promote research towards agents that use learning from human feedback (LfHF) techniques to solve open-world tasks. Rather than mandating the use of LfHF techniques, we described four tasks in natural language to be accomplished in the video game Minecraft. Teams developed a diverse range of LfHF algorithms across a variety of possible human feedback types.
arXiv Detail & Related papers (2022-04-14T17:24:54Z)
Reed at SemEval-2020 Task 9: Fine-Tuning and Bag-of-Words Approaches to Code-Mixed Sentiment Analysis [1.2147145617662432]
We explore the task of sentiment analysis on Hinglish (code-mixed Hindi-English) tweets as participants of Task 9 of the SemEval-2020 competition, known as the SentiMix task. We had two main approaches: 1) applying transfer learning by fine-tuning pre-trained BERT models and 2) training feedforward neural networks on bag-of-words representations. During the evaluation phase of the competition, we obtained an F-score of 71.3% with our best model, which placed $4th$ out of 62 entries in the official system rankings.
arXiv Detail & Related papers (2020-07-26T05:48:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.