Wait, that's not an option: LLMs Robustness with Incorrect Multiple-Choice Options
- URL: http://arxiv.org/abs/2409.00113v2
- Date: Thu, 10 Oct 2024 20:46:36 GMT
- Title: Wait, that's not an option: LLMs Robustness with Incorrect Multiple-Choice Options
- Authors: Gracjan Góral, Emilia Wiśnios, Piotr Sankowski, Paweł Budzianowski,
- Abstract summary: This study explores whether large language models (LLMs) prioritize following instructions over reasoning and truth when given "misleading" instructions.
We introduce a new metric called "reflective judgment", which sheds new light on the relationship between the pre-training and post-training alignment schemes.
- Score: 2.1184929769291294
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Decision-making under full alignment requires balancing between reasoning and faithfulness - a challenge for large language models (LLMs). This study explores whether LLMs prioritize following instructions over reasoning and truth when given "misleading" instructions, such as "Respond solely with A or B", even when neither option is correct. We introduce a new metric called "reflective judgment", which sheds new light on the relationship between the pre-training and post-training alignment schemes. In tasks ranging from basic arithmetic to domain-specific assessments, models like GPT-4o, o1-mini, or Claude 3 Opus adhered to instructions correctly but failed to reflect on the validity of the provided options. Contrary, models from the Llama 3.1 family (8B, 70B, 405B) or base Qwen2.5 (7B, 14B, 32B) families exhibit improved refusal rates with size, indicating a scaling effect. We also observed that alignment techniques, though intended to enhance reasoning, sometimes weakened the models' ability to reject incorrect instructions, leading them to follow flawed prompts uncritically. Finally, we have also conducted a parallel human study revealing similar patterns in human behavior and annotations. We highlight how popular RLHF datasets might disrupt either training or evaluation due to annotations exhibiting poor reflective judgement.
Related papers
- Diverging Preferences: When do Annotators Disagree and do Models Know? [92.24651142187989]
We develop a taxonomy of disagreement sources spanning 10 categories across four high-level classes.
We find that the majority of disagreements are in opposition with standard reward modeling approaches.
We develop methods for identifying diverging preferences to mitigate their influence on evaluation and training.
arXiv Detail & Related papers (2024-10-18T17:32:22Z) - Just Say What You Want: Only-prompting Self-rewarding Online Preference Optimization [64.34767799614328]
Current self-rewarding approaches rely heavily on the discriminator's judgment capabilities.
We propose a novel, only-prompting self-rewarding online algorithm that generates preference datasets without relying on judgment capabilities.
arXiv Detail & Related papers (2024-09-26T04:41:08Z) - Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models [36.119299938503936]
Large vision-language models (LVLMs) have shown promising performance on a variety of vision-language tasks.
They remain susceptible to hallucinations, generating outputs misaligned with visual content or instructions.
We propose reflective instruction tuning, which integrates rationale learning into visual instruction tuning.
arXiv Detail & Related papers (2024-07-16T06:32:45Z) - Learning to Correct for QA Reasoning with Black-box LLMs [37.13135300208977]
We propose CoBB (Correct for improving QA reasoning of Black-Box LLMs) as an open challenge in machine learning.
It uses a trained adaptation model to perform a seq2seq mapping from the often-imperfect reasonings of the original black-box LLM to the correct or improved reasonings.
Our experimental results demonstrate that CoBB significantly improves reasoning accuracy across various QA benchmarks.
arXiv Detail & Related papers (2024-06-26T18:57:32Z) - LACIE: Listener-Aware Finetuning for Confidence Calibration in Large Language Models [69.68379406317682]
We introduce a listener-aware finetuning method (LACIE) to calibrate implicit and explicit confidence markers.
We show that LACIE models the listener, considering not only whether an answer is right, but whether it will be accepted by a listener.
We find that training with LACIE results in 47% fewer incorrect answers being accepted while maintaining the same level of acceptance for correct answers.
arXiv Detail & Related papers (2024-05-31T17:16:38Z) - Dissecting Human and LLM Preferences [80.55271307662365]
We find that humans are less sensitive to errors, favor responses that support their stances, and show clear dislike when models admit their limits.
advanced LLMs like GPT-4-Turbo emphasize correctness, clarity, and harmlessness more.
We show that preference-based evaluation can be intentionally manipulated.
arXiv Detail & Related papers (2024-02-17T14:34:31Z) - Are We Falling in a Middle-Intelligence Trap? An Analysis and Mitigation
of the Reversal Curse [73.65112477688353]
Recent studies have highlighted a phenomenon in large language models known as "the reversal curse"
We contend that the reversal curse is partially a result of specific model training objectives.
We propose a novel training method, BI Casual language modeling Optimization (BICO), designed to mitigate the reversal curse.
arXiv Detail & Related papers (2023-11-13T17:01:12Z) - Making Large Language Models Better Reasoners with Alignment [57.82176656663245]
Reasoning is a cognitive process of using evidence to reach a sound conclusion.
Recent studies reveal that fine-tuning LLMs on data with the chain of thought (COT) reasoning process can significantly enhance their reasoning capabilities.
We introduce an textitAlignment Fine-Tuning (AFT) paradigm, which involves three steps.
arXiv Detail & Related papers (2023-09-05T11:32:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.