ChatGPT Inaccuracy Mitigation during Technical Report Understanding: Are We There Yet?
- URL: http://arxiv.org/abs/2411.07360v1
- Date: Mon, 11 Nov 2024 20:54:54 GMT
- Title: ChatGPT Inaccuracy Mitigation during Technical Report Understanding: Are We There Yet?
- Authors: Salma Begum Tamanna, Gias Uddin, Song Wang, Lan Xia, Longyu Zhang,
- Abstract summary: It is unknown how ChatGPT hallucinates for technical texts that contain both textual and technical terms.
ChiME uses context-free grammar to parse stack traces in technical reports.
ChiME shows 30.3% more correction over ChatGPT responses.
- Score: 6.079560395398429
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Hallucinations, the tendency to produce irrelevant/incorrect responses, are prevalent concerns in generative AI-based tools like ChatGPT. Although hallucinations in ChatGPT are studied for textual responses, it is unknown how ChatGPT hallucinates for technical texts that contain both textual and technical terms. We surveyed 47 software engineers and produced a benchmark of 412 Q&A pairs from the bug reports of two OSS projects. We find that a RAG-based ChatGPT (i.e., ChatGPT tuned with the benchmark issue reports) is 36.4% correct when producing answers to the questions, due to two reasons 1) limitations to understand complex technical contents in code snippets like stack traces, and 2) limitations to integrate contexts denoted in the technical terms and texts. We present CHIME (ChatGPT Inaccuracy Mitigation Engine) whose underlying principle is that if we can preprocess the technical reports better and guide the query validation process in ChatGPT, we can address the observed limitations. CHIME uses context-free grammar (CFG) to parse stack traces in technical reports. CHIME then verifies and fixes ChatGPT responses by applying metamorphic testing and query transformation. In our benchmark, CHIME shows 30.3% more correction over ChatGPT responses. In a user study, we find that the improved responses with CHIME are considered more useful than those generated from ChatGPT without CHIME.
Related papers
- What Developers Ask to ChatGPT in GitHub Pull Requests? an Exploratory Study [0.0]
Large Language Models (LLMs) such as ChatGPT have introduced a new set of tools to support software developers in solving pro- gramming tasks.<n>To explore this limitation, we conducted a manual evaluation of 155 valid ChatGPT links extracted from 139 merged Pull Requests.<n>Our results produced a catalog of 14 types of ChatGPT requests categorized into four main groups.
arXiv Detail & Related papers (2025-08-23T23:24:47Z) - Why Do Developers Engage with ChatGPT in Issue-Tracker? Investigating Usage and Reliance on ChatGPT-Generated Code [4.605779671279481]
We analyzed 1,152 Developer-ChatGPT conversations across 1,012 issues in GitHub.
ChatGPT is primarily utilized for ideation, whereas its usage for validation is minimal.
ChatGPT-generated code was used as-is to resolve only 5.83% of the issues.
arXiv Detail & Related papers (2024-12-09T18:47:31Z) - Fight Fire with Fire: How Much Can We Trust ChatGPT on Source Code-Related Tasks? [10.389763758883975]
Recent studies proposed utilizing ChatGPT as both a developer and tester for collaborative software development.
We conduct a comprehensive empirical investigation to evaluate ChatGPT's self-verification capability in code generation, code completion, and program repair.
arXiv Detail & Related papers (2024-05-21T09:47:33Z) - Exploring ChatGPT's Capabilities on Vulnerability Management [56.4403395100589]
We explore ChatGPT's capabilities on 6 tasks involving the complete vulnerability management process with a large-scale dataset containing 70,346 samples.
One notable example is ChatGPT's proficiency in tasks like generating titles for software bug reports.
Our findings reveal the difficulties encountered by ChatGPT and shed light on promising future directions.
arXiv Detail & Related papers (2023-11-11T11:01:13Z) - DEMASQ: Unmasking the ChatGPT Wordsmith [63.8746084667206]
We propose an effective ChatGPT detector named DEMASQ, which accurately identifies ChatGPT-generated content.
Our method addresses two critical factors: (i) the distinct biases in text composition observed in human- and machine-generated content and (ii) the alterations made by humans to evade previous detection methods.
arXiv Detail & Related papers (2023-11-08T21:13:05Z) - A Critical Review of Large Language Model on Software Engineering: An Example from ChatGPT and Automated Program Repair [19.123640635549524]
Large Language Models (LLMs) have been gaining increasing attention and demonstrated promising performance across a variety of software engineering tasks.
This paper reviews the bug-fixing capabilities of ChatGPT on a clean APR benchmark with different research objectives.
ChatGPT is able to fix 109 out of 151 buggy programs using the basic prompt within 35 independent rounds, outperforming state-of-the-art LLMs CodeT5 and PLBART by 27.5% and 62.4% prediction accuracy.
arXiv Detail & Related papers (2023-10-13T06:11:47Z) - ChatLog: Carefully Evaluating the Evolution of ChatGPT Across Time [54.18651663847874]
ChatGPT has achieved great success and can be considered to have acquired an infrastructural status.
Existing benchmarks encounter two challenges: (1) Disregard for periodical evaluation and (2) Lack of fine-grained features.
We construct ChatLog, an ever-updating dataset with large-scale records of diverse long-form ChatGPT responses for 21 NLP benchmarks from March, 2023 to now.
arXiv Detail & Related papers (2023-04-27T11:33:48Z) - To ChatGPT, or not to ChatGPT: That is the question! [78.407861566006]
This study provides a comprehensive and contemporary assessment of the most recent techniques in ChatGPT detection.
We have curated a benchmark dataset consisting of prompts from ChatGPT and humans, including diverse questions from medical, open Q&A, and finance domains.
Our evaluation results demonstrate that none of the existing methods can effectively detect ChatGPT-generated content.
arXiv Detail & Related papers (2023-04-04T03:04:28Z) - Towards Making the Most of ChatGPT for Machine Translation [75.576405098545]
ChatGPT shows remarkable capabilities for machine translation (MT)
Several prior studies have shown that it achieves comparable results to commercial systems for high-resource languages.
arXiv Detail & Related papers (2023-03-24T03:35:21Z) - ChatGPT or Grammarly? Evaluating ChatGPT on Grammatical Error Correction
Benchmark [11.36853733574956]
ChatGPT is a cutting-edge artificial intelligence language model developed by OpenAI.
We compare it with commercial GEC product (e.g., Grammarly) and state-of-the-art models (e.g., GECToR)
We find that ChatGPT performs not as well as those baselines in terms of the automatic evaluation metrics.
arXiv Detail & Related papers (2023-03-15T00:35:50Z) - Can ChatGPT Understand Too? A Comparative Study on ChatGPT and
Fine-tuned BERT [103.57103957631067]
ChatGPT has attracted great attention, as it can generate fluent and high-quality responses to human inquiries.
We evaluate ChatGPT's understanding ability by evaluating it on the most popular GLUE benchmark, and comparing it with 4 representative fine-tuned BERT-style models.
We find that: 1) ChatGPT falls short in handling paraphrase and similarity tasks; 2) ChatGPT outperforms all BERT models on inference tasks by a large margin; 3) ChatGPT achieves comparable performance compared with BERT on sentiment analysis and question answering tasks.
arXiv Detail & Related papers (2023-02-19T12:29:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.