Assessing and Enhancing the Robustness of Large Language Models with Task Structure Variations for Logical Reasoning
- URL: http://arxiv.org/abs/2310.09430v4
- Date: Sat, 30 Mar 2024 09:49:19 GMT
- Title: Assessing and Enhancing the Robustness of Large Language Models with Task Structure Variations for Logical Reasoning
- Authors: Qiming Bao, Gael Gendron, Alex Yuxuan Peng, Wanjun Zhong, Neset Tan, Yang Chen, Michael Witbrock, Jiamou Liu,
- Abstract summary: We develop three new logical reasoning datasets named "ReClor-plus", "LogiQA-plus" and "LogiQAv2-plus"
Experiments show that these simple augmentations greatly hinder the models' performance.
Applying logic-driven data augmentation for fine-tuning and prompting can enhance generalisation in both discriminative and generative models.
- Score: 25.496627355906966
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large language models (LLMs), such as LLaMA, Alpaca, Vicuna, GPT-3.5 and GPT-4, have advanced the performance of AI systems on various natural language processing tasks to human-like levels. However, their generalisation and robustness when performing logical reasoning has not been sufficiently assessed. To comprehensively evaluate this ability, we develop three new logical reasoning datasets named "ReClor-plus", "LogiQA-plus" and "LogiQAv2-plus" that extend standard logical reasoning datasets to evaluate the robustness of the LLM's reasoning. For each, we create three subsets: the first with randomly shuffled options, the second with the correct choices replaced by "none of the other options is correct", and the third with a combination of shuffling and substitution. Experiments on these datasets show that these simple augmentations greatly hinder the models' performance. Despite their high performance on the original publicly available datasets, we find that all models perform poorly on these newly constructed datasets. We also demonstrate that introducing task variations into the training set can markedly improve the model's performance on both the original and our developed datasets. Finally, we show that applying logic-driven data augmentation for fine-tuning and prompting can enhance generalisation in both discriminative and generative models, offering a path to improving their robustness for tasks involving logical reasoning. Source code and data are made publicly available at https://github.com/Strong-AI-Lab/Logical-and-abstract-reasoning.
Related papers
- Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - Transformer-based Language Models for Reasoning in the Description Logic ALCQ [2.8210912543324658]
We construct the natural language dataset, DELTA$_D$, using the expressive description logic language $mathcalALCQ$.
We investigate the logical reasoning capabilities of a supervised fine-tuned DeBERTa-based model and two large language models.
We show that the DeBERTa-based model fine-tuned on our dataset can master the entailment checking task.
arXiv Detail & Related papers (2024-10-12T18:25:34Z) - Improving Language Model Reasoning with Self-motivated Learning [60.779625789039486]
textitSelf-motivated Learning framework motivates the model itself to automatically generate rationales on existing datasets.
We train a reward model with the rank to evaluate the quality of rationales, and improve the performance of reasoning through reinforcement learning.
arXiv Detail & Related papers (2024-04-10T14:05:44Z) - LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system.
We build a novel data-cleaning pipeline that uses these principles to transform existing programs.
We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z) - Dynamics of Instruction Tuning: Each Ability of Large Language Models
Has Its Own Growth Pace [21.015261553612643]
We present a dataset with over 40k instances across ten abilities and examine instruction-tuned models with 7b to 33b parameters.
Our study reveals three primary findings: (i) Despite the models' overall performance being tied to data and parameter scale, individual abilities have different sensitivities to these factors.
Human-curated data strongly outperforms synthetic data from GPT-4 in efficiency and can constantly enhance model performance with volume increases.
arXiv Detail & Related papers (2023-10-30T15:37:10Z) - Guiding Language Model Reasoning with Planning Tokens [122.43639723387516]
Large language models (LLMs) have recently attracted considerable interest for their ability to perform complex reasoning tasks.
We propose a hierarchical generation scheme to encourage a more structural generation of chain-of-thought steps.
Our approach requires a negligible increase in trainable parameters (0.001%) and can be applied through either full fine-tuning or a more parameter-efficient scheme.
arXiv Detail & Related papers (2023-10-09T13:29:37Z) - Abstract Meaning Representation-Based Logic-Driven Data Augmentation for Logical Reasoning [27.224364543134094]
We introduce a novel logic-driven data augmentation approach, AMR-LDA.
AMR-LDA converts the original text into an Abstract Meaning Representation (AMR) graph.
The modified AMR graphs are subsequently converted back into text to create augmented data.
arXiv Detail & Related papers (2023-05-21T23:16:26Z) - In all LikelihoodS: How to Reliably Select Pseudo-Labeled Data for
Self-Training in Semi-Supervised Learning [0.0]
Self-training is a simple yet effective method within semi-supervised learning.
In this paper, we aim at rendering PLS more robust towards the involved modeling assumptions.
Results suggest that in particular robustness w.r.t. model choice can lead to substantial accuracy gains.
arXiv Detail & Related papers (2023-03-02T10:00:37Z) - Improving Commonsense Causal Reasoning by Adversarial Training and Data
Augmentation [14.92157586545743]
This paper presents a number of techniques for making models more robust in the domain of causal reasoning.
We show a statistically significant improvement on performance and on both datasets, even with only a small number of additionally generated data points.
arXiv Detail & Related papers (2021-01-13T09:55:29Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z) - ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning [85.33459673197149]
We introduce a new Reading dataset requiring logical reasoning (ReClor) extracted from standardized graduate admission examinations.
In this paper, we propose to identify biased data points and separate them into EASY set and the rest as HARD set.
Empirical results show that state-of-the-art models have an outstanding ability to capture biases contained in the dataset with high accuracy on EASY set.
However, they struggle on HARD set with poor performance near that of random guess, indicating more research is needed to essentially enhance the logical reasoning ability of current models.
arXiv Detail & Related papers (2020-02-11T11:54:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.