Comparing GPT-4 and Open-Source Language Models in Misinformation
Mitigation
- URL: http://arxiv.org/abs/2401.06920v1
- Date: Fri, 12 Jan 2024 22:27:25 GMT
- Title: Comparing GPT-4 and Open-Source Language Models in Misinformation
Mitigation
- Authors: Tyler Vergho, Jean-Francois Godbout, Reihaneh Rabbany, Kellin Pelrine
- Abstract summary: GPT-4 is known to be strong in this domain, but it is closed source, potentially expensive, and can show instability between different versions.
We show that Zephyr-7b presents a consistently viable alternative, overcoming key limitations of commonly used approaches.
We then highlight how GPT-3.5 exhibits unstable performance, such that this very widely used model could provide misleading results in misinformation detection.
- Score: 6.929834518749884
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent large language models (LLMs) have been shown to be effective for
misinformation detection. However, the choice of LLMs for experiments varies
widely, leading to uncertain conclusions. In particular, GPT-4 is known to be
strong in this domain, but it is closed source, potentially expensive, and can
show instability between different versions. Meanwhile, alternative LLMs have
given mixed results. In this work, we show that Zephyr-7b presents a
consistently viable alternative, overcoming key limitations of commonly used
approaches like Llama-2 and GPT-3.5. This provides the research community with
a solid open-source option and shows open-source models are gradually catching
up on this task. We then highlight how GPT-3.5 exhibits unstable performance,
such that this very widely used model could provide misleading results in
misinformation detection. Finally, we validate new tools including approaches
to structured output and the latest version of GPT-4 (Turbo), showing they do
not compromise performance, thus unlocking them for future research and
potentially enabling more complex pipelines for misinformation mitigation.
Related papers
- Evaluating the Performance of Large Language Models for SDG Mapping (Technical Report) [6.789534723913505]
Large language models (LLMs) enable users to protect data privacy by eliminating the need to provide data to third parties.
We compare the performance of various language models on the Sustainable Development Goal mapping task.
According to the results of this study, LLaMA 2 and Gemma still have significant room for improvement.
arXiv Detail & Related papers (2024-08-05T03:05:02Z) - Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games [56.70628673595041]
Large Language Models (LLMs) have been increasingly used in real-world settings, yet their strategic decision-making abilities remain largely unexplored.
This work investigates the performance and merits of LLMs in canonical game-theoretic two-player non-zero-sum games, Stag Hunt and Prisoner Dilemma.
Our structured evaluation of GPT-3.5, GPT-4-Turbo, GPT-4o, and Llama-3-8B shows that these models, when making decisions in these games, are affected by at least one of the following systematic biases.
arXiv Detail & Related papers (2024-07-05T12:30:02Z) - ExaRanker-Open: Synthetic Explanation for IR using Open-Source LLMs [60.81649785463651]
We introduce ExaRanker-Open, where we adapt and explore the use of open-source language models to generate explanations.
Our findings reveal that incorporating explanations consistently enhances neural rankers, with benefits escalating as the LLM size increases.
arXiv Detail & Related papers (2024-02-09T11:23:14Z) - MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large
Language Models [70.92847554971065]
We introduce MT-Eval, a comprehensive benchmark designed to evaluate multi-turn conversational abilities.
By analyzing human-LLM conversations, we categorize interaction patterns into four types: recollection, expansion, refinement, and follow-up.
Our evaluation of 11 well-known LLMs shows that while closed-source models generally surpass open-source ones, certain open-source models exceed GPT-3.5-Turbo in specific tasks.
arXiv Detail & Related papers (2024-01-30T04:50:28Z) - Large Language Model for Vulnerability Detection: Emerging Results and
Future Directions [15.981132063061661]
Previous learning-based vulnerability detection methods relied on either medium-sized pre-trained models or smaller neural networks from scratch.
Recent advancements in Large Pre-Trained Language Models (LLMs) have showcased remarkable few-shot learning capabilities in various tasks.
arXiv Detail & Related papers (2024-01-27T17:39:36Z) - RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Large
Language Models [56.51705482912727]
We present RankVicuna, the first fully open-source LLM capable of performing high-quality listwise reranking in a zero-shot setting.
Experimental results on the TREC 2019 and 2020 Deep Learning Tracks show that we can achieve effectiveness comparable to zero-shot reranking with GPT-3.5 with a much smaller 7B parameter model, although our effectiveness remains slightly behind reranking with GPT-4.
arXiv Detail & Related papers (2023-09-26T17:31:57Z) - DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT
Models [92.6951708781736]
This work proposes a comprehensive trustworthiness evaluation for large language models with a focus on GPT-4 and GPT-3.5.
We find that GPT models can be easily misled to generate toxic and biased outputs and leak private information.
Our work illustrates a comprehensive trustworthiness evaluation of GPT models and sheds light on the trustworthiness gaps.
arXiv Detail & Related papers (2023-06-20T17:24:23Z) - Towards Reliable Misinformation Mitigation: Generalization, Uncertainty,
and GPT-4 [5.313670352036673]
We show that GPT-4 can outperform prior methods in multiple settings and languages.
We propose techniques to handle uncertainty that can detect impossible examples and strongly improve outcomes.
This research lays the groundwork for future tools that can drive real-world progress to combat misinformation.
arXiv Detail & Related papers (2023-05-24T09:10:20Z) - Prompting GPT-3 To Be Reliable [117.23966502293796]
This work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality.
We find that GPT-3 outperforms smaller-scale supervised models by large margins on all these facets.
arXiv Detail & Related papers (2022-10-17T14:52:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.