Related papers: Large Language Models on Wikipedia-Style Survey Generation: an Evaluation in NLP Concepts

Large Language Models on Wikipedia-Style Survey Generation: an Evaluation in NLP Concepts

URL: http://arxiv.org/abs/2308.10410v4
Date: Thu, 23 May 2024 12:42:06 GMT
Title: Large Language Models on Wikipedia-Style Survey Generation: an Evaluation in NLP Concepts
Authors: Fan Gao, Hang Jiang, Rui Yang, Qingcheng Zeng, Jinghui Lu, Moritz Blum, Dairui Liu, Tianwei She, Yuang Jiang, Irene Li,
Abstract summary: Large Language Models (LLMs) have achieved significant success across various general tasks. In this work, we examine the proficiency of LLMs in generating succinct survey articles specific to the niche field of NLP in computer science. We compare both human and GPT-based evaluation scores and provide in-depth analysis.
Score: 21.150221839202878
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Educational materials such as survey articles in specialized fields like computer science traditionally require tremendous expert inputs and are therefore expensive to create and update. Recently, Large Language Models (LLMs) have achieved significant success across various general tasks. However, their effectiveness and limitations in the education domain are yet to be fully explored. In this work, we examine the proficiency of LLMs in generating succinct survey articles specific to the niche field of NLP in computer science, focusing on a curated list of 99 topics. Automated benchmarks reveal that GPT-4 surpasses its predecessors, inluding GPT-3.5, PaLM2, and LLaMa2 by margins ranging from 2% to 20% in comparison to the established ground truth. We compare both human and GPT-based evaluation scores and provide in-depth analysis. While our findings suggest that GPT-created surveys are more contemporary and accessible than human-authored ones, certain limitations were observed. Notably, GPT-4, despite often delivering outstanding content, occasionally exhibited lapses like missing details or factual errors. At last, we compared the rating behavior between humans and GPT-4 and found systematic bias in using GPT evaluation.

Related papers

An Empirical Analysis on Large Language Models in Debate Evaluation [10.677407097411768]
We investigate the capabilities and inherent biases of advanced large language models (LLMs) such as GPT-3.5 and GPT-4 in the context of debate evaluation. We uncover a consistent bias in both GPT-3.5 and GPT-4 towards the second candidate response presented. We also uncover lexical biases in both GPT-3.5 and GPT-4, especially when label sets carry connotations such as numerical or sequential.
arXiv Detail & Related papers (2024-05-28T18:34:53Z)
Can large language models replace humans in the systematic review process? Evaluating GPT-4's efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages [0.0]
This study evaluates GPT-4's capability in title/abstract screening, full-text review, and data extraction using a 'human-out-of-the-loop' approach. GPT-4 had accuracy on par with human performance in most tasks, but results were skewed by chance agreement and dataset imbalance. When screening full-text literature using highly reliable prompts, GPT-4's performance was 'almost perfect'
arXiv Detail & Related papers (2023-10-26T16:18:30Z)
Can large language models provide useful feedback on research papers? A large-scale empirical analysis [38.905758846360435]
High-quality peer reviews are increasingly difficult to obtain. With the breakthrough of large language models (LLM) such as GPT-4, there is growing interest in using LLMs to generate scientific feedback. We created an automated pipeline using GPT-4 to provide comments on the full PDFs of scientific papers.
arXiv Detail & Related papers (2023-10-03T04:14:17Z)
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models [92.6951708781736]
This work proposes a comprehensive trustworthiness evaluation for large language models with a focus on GPT-4 and GPT-3.5. We find that GPT models can be easily misled to generate toxic and biased outputs and leak private information. Our work illustrates a comprehensive trustworthiness evaluation of GPT models and sheds light on the trustworthiness gaps.
arXiv Detail & Related papers (2023-06-20T17:24:23Z)
Thrilled by Your Progress! Large Language Models (GPT-4) No Longer Struggle to Pass Assessments in Higher Education Programming Courses [0.0]
GPT models evolved from completely failing the typical programming class' assessments to confidently passing the courses with no human involvement. This study provides evidence that programming instructors need to prepare for a world in which there is an easy-to-use technology that can be utilized by learners to collect passing scores.
arXiv Detail & Related papers (2023-06-15T22:12:34Z)
Is GPT-4 a Good Data Analyst? [67.35956981748699]
We consider GPT-4 as a data analyst to perform end-to-end data analysis with databases from a wide range of domains. We design several task-specific evaluation metrics to systematically compare the performance between several professional human data analysts and GPT-4. Experimental results show that GPT-4 can achieve comparable performance to humans.
arXiv Detail & Related papers (2023-05-24T11:26:59Z)
Exploring the Trade-Offs: Unified Large Language Models vs Local Fine-Tuned Models for Highly-Specific Radiology NLI Task [49.50140712943701]
We evaluate the performance of ChatGPT/GPT-4 on a radiology NLI task and compare it to other models fine-tuned specifically on task-related data samples. We also conduct a comprehensive investigation on ChatGPT/GPT-4's reasoning ability by introducing varying levels of inference difficulty.
arXiv Detail & Related papers (2023-04-18T17:21:48Z)
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models [122.63704560157909]
We introduce AGIEval, a novel benchmark designed to assess foundation model in the context of human-centric standardized exams. We evaluate several state-of-the-art foundation models, including GPT-4, ChatGPT, and Text-Davinci-003. GPT-4 surpasses average human performance on SAT, LSAT, and math competitions, attaining a 95% accuracy rate on the SAT Math test and a 92.5% accuracy on the English test of the Chinese national college entrance exam.
arXiv Detail & Related papers (2023-04-13T09:39:30Z)
GPT-4 Technical Report [116.90398195245983]
GPT-4 is a large-scale, multimodal model which can accept image and text inputs and produce text outputs. It exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.
arXiv Detail & Related papers (2023-03-15T17:15:04Z)
Prompting GPT-3 To Be Reliable [117.23966502293796]
This work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality. We find that GPT-3 outperforms smaller-scale supervised models by large margins on all these facets.
arXiv Detail & Related papers (2022-10-17T14:52:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.