Related papers: ChatGPT vs LLaMA: Impact, Reliability, and Challenges in Stack Overflow Discussions

ChatGPT vs LLaMA: Impact, Reliability, and Challenges in Stack Overflow Discussions

URL: http://arxiv.org/abs/2402.08801v1
Date: Tue, 13 Feb 2024 21:15:33 GMT
Title: ChatGPT vs LLaMA: Impact, Reliability, and Challenges in Stack Overflow Discussions
Authors: Leuson Da Silva and Jordan Samhi and Foutse Khomh
Abstract summary: ChatGPT has shaken up Stack Overflow, the premier platform for developers' queries on programming and software development. Two months after ChatGPT's release, Meta released its answer with its own Large Language Model (LLM) called LLaMA: the race was on.
Score: 13.7001994656622
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Since its release in November 2022, ChatGPT has shaken up Stack Overflow, the premier platform for developers' queries on programming and software development. Demonstrating an ability to generate instant, human-like responses to technical questions, ChatGPT has ignited debates within the developer community about the evolving role of human-driven platforms in the age of generative AI. Two months after ChatGPT's release, Meta released its answer with its own Large Language Model (LLM) called LLaMA: the race was on. We conducted an empirical study analyzing questions from Stack Overflow and using these LLMs to address them. This way, we aim to (ii) measure user engagement evolution with Stack Overflow over time; (ii) quantify the reliability of LLMs' answers and their potential to replace Stack Overflow in the long term; (iii) identify and understand why LLMs fails; and (iv) compare LLMs together. Our empirical results are unequivocal: ChatGPT and LLaMA challenge human expertise, yet do not outperform it for some domains, while a significant decline in user posting activity has been observed. Furthermore, we also discuss the impact of our findings regarding the usage and development of new LLMs.

Related papers

Developer Challenges on Large Language Models: A Study of Stack Overflow and OpenAI Developer Forum Posts [2.704899832646869]
Large Language Models (LLMs) have gained widespread popularity due to their exceptional capabilities across various domains. This study investigates developers' challenges by analyzing community interactions on Stack Overflow and OpenAI Developer Forum.
arXiv Detail & Related papers (2024-11-16T19:38:27Z)
LLMs are Imperfect, Then What? An Empirical Study on LLM Failures in Software Engineering [38.20696656193963]
We conducted an observational study with 22 participants using ChatGPT as a coding assistant in a non-trivial software engineering task. We identified the cases where ChatGPT failed, their root causes, and the corresponding mitigation solutions used by users.
arXiv Detail & Related papers (2024-11-15T03:29:41Z)
An Empirical Study on Challenges for LLM Application Developers [28.69628251749012]
We crawl and analyze 29,057 relevant questions from a popular OpenAI developer forum. After manually analyzing 2,364 sampled questions, we construct a taxonomy of challenges faced by LLM developers.
arXiv Detail & Related papers (2024-08-06T05:46:28Z)
StackRAG Agent: Improving Developer Answers with Retrieval-Augmented Generation [2.225268436173329]
StackRAG is a retrieval-augmented Multiagent generation tool based on Large Language Models. It combines the two worlds: aggregating the knowledge from SO to enhance the reliability of the generated answers. Initial evaluations show that the generated answers are correct, accurate, relevant, and useful.
arXiv Detail & Related papers (2024-06-19T21:07:35Z)
When LLMs Meet Cunning Texts: A Fallacy Understanding Benchmark for Large Language Models [59.84769254832941]
We propose a FaLlacy Understanding Benchmark (FLUB) containing cunning texts that are easy for humans to understand but difficult for models to grasp. Specifically, the cunning texts that FLUB focuses on mainly consist of the tricky, humorous, and misleading texts collected from the real internet environment. Based on FLUB, we investigate the performance of multiple representative and advanced LLMs.
arXiv Detail & Related papers (2024-02-16T22:12:53Z)
Large Language Models: A Survey [69.72787936480394]
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks. LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data.
arXiv Detail & Related papers (2024-02-09T05:37:09Z)
ChatGPT's One-year Anniversary: Are Open-Source Large Language Models Catching up? [71.12709925152784]
ChatGPT has brought a seismic shift in the entire landscape of AI. It showed that a model could answer human questions and follow instructions on a broad panel of tasks. While closed-source LLMs generally outperform their open-source counterparts, the progress on the latter has been rapid. This has crucial implications not only on research but also on business.
arXiv Detail & Related papers (2023-11-28T17:44:51Z)
LM-Polygraph: Uncertainty Estimation for Language Models [71.21409522341482]
Uncertainty estimation (UE) methods are one path to safer, more responsible, and more effective use of large language models (LLMs) We introduce LM-Polygraph, a framework with implementations of a battery of state-of-the-art UE methods for LLMs in text generation tasks, with unified program interfaces in Python. It introduces an extendable benchmark for consistent evaluation of UE techniques by researchers, and a demo web application that enriches the standard chat dialog with confidence scores.
arXiv Detail & Related papers (2023-11-13T15:08:59Z)
Investigating Answerability of LLMs for Long-Form Question Answering [35.41413072729483]
We focus on long-form question answering (LFQA) because it has several practical and impactful applications. We propose a question-generation method from abstractive summaries and show that generating follow-up questions from summaries of long documents can create a challenging setting.
arXiv Detail & Related papers (2023-09-15T07:22:56Z)
Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback [127.75419038610455]
Large language models (LLMs) are able to generate human-like, fluent responses for many downstream tasks. This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules.
arXiv Detail & Related papers (2023-02-24T18:48:43Z)
A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity [79.12003701981092]
We carry out an extensive technical evaluation of ChatGPT using 23 data sets covering 8 different common NLP application tasks. We evaluate the multitask, multilingual and multi-modal aspects of ChatGPT based on these data sets and a newly designed multimodal dataset. ChatGPT is 63.41% accurate on average in 10 different reasoning categories under logical reasoning, non-textual reasoning, and commonsense reasoning.
arXiv Detail & Related papers (2023-02-08T12:35:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.