Battle of the Large Language Models: Dolly vs LLaMA vs Vicuna vs Guanaco
vs Bard vs ChatGPT -- A Text-to-SQL Parsing Comparison
- URL: http://arxiv.org/abs/2310.10190v1
- Date: Mon, 16 Oct 2023 08:52:41 GMT
- Title: Battle of the Large Language Models: Dolly vs LLaMA vs Vicuna vs Guanaco
vs Bard vs ChatGPT -- A Text-to-SQL Parsing Comparison
- Authors: Shuo Sun, Yuchen Zhang, Jiahuan Yan, Yuze Gao, Donovan Ong, Bin Chen,
Jian Su
- Abstract summary: In recent times, a number of models have emerged, claiming performance near that of GPT-3.5 or GPT-4.
We pit six popular large language models against each other, systematically evaluating their Text-to- parsing capability on nine benchmark datasets.
Regrettably, the open-sourced models fell significantly short of the performance achieved by closed-source models like GPT-3.5, highlighting the need for further work to bridge the performance gap between these models.
- Score: 18.092211166785397
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The success of ChatGPT has ignited an AI race, with researchers striving to
develop new large language models (LLMs) that can match or surpass the language
understanding and generation abilities of commercial ones. In recent times, a
number of models have emerged, claiming performance near that of GPT-3.5 or
GPT-4 through various instruction-tuning methods. As practitioners of
Text-to-SQL parsing, we are grateful for their valuable contributions to
open-source research. However, it is important to approach these claims with a
sense of scrutiny and ascertain the actual effectiveness of these models.
Therefore, we pit six popular large language models against each other,
systematically evaluating their Text-to-SQL parsing capability on nine
benchmark datasets with five different prompting strategies, covering both
zero-shot and few-shot scenarios. Regrettably, the open-sourced models fell
significantly short of the performance achieved by closed-source models like
GPT-3.5, highlighting the need for further work to bridge the performance gap
between these models.
Related papers
- MSc-SQL: Multi-Sample Critiquing Small Language Models For Text-To-SQL Translation [10.205010004198757]
Text-to-generation enables non-experts to interact with databases via natural language.
Recent advances on large closed-source models like GPT-4 present challenges in accessibility, privacy, and latency.
We focus on developing small, efficient, and open-source text-to-generation models.
arXiv Detail & Related papers (2024-10-16T18:03:24Z) - What Is Missing in Multilingual Visual Reasoning and How to Fix It [64.47951359580556]
We evaluate NLP models' multilingual, multimodal capabilities by testing on a visual reasoning task.
proprietary systems like GPT-4V obtain the best performance on this task now, but open models lag in comparison.
Our interventions achieve the best open performance on this task in a zero-shot setting, boosting open model LLaVA by 13.4%.
arXiv Detail & Related papers (2024-03-03T05:45:27Z) - Large Language Models as Zero-shot Dialogue State Tracker through Function Calling [42.00097476584174]
We propose a novel approach for solving dialogue state tracking with large language models (LLMs) through function calling.
This method improves zero-shot DST, allowing adaptation to diverse domains without extensive data collection or model tuning.
We show that our approach achieves exceptional performance with both modestly sized open-source and also proprietary LLMs.
arXiv Detail & Related papers (2024-02-16T06:13:18Z) - MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large
Language Models [70.92847554971065]
We introduce MT-Eval, a comprehensive benchmark designed to evaluate multi-turn conversational abilities.
By analyzing human-LLM conversations, we categorize interaction patterns into four types: recollection, expansion, refinement, and follow-up.
Our evaluation of 11 well-known LLMs shows that while closed-source models generally surpass open-source ones, certain open-source models exceed GPT-3.5-Turbo in specific tasks.
arXiv Detail & Related papers (2024-01-30T04:50:28Z) - A Text-to-Text Model for Multilingual Offensive Language Identification [19.23565690468299]
This study presents the first pre-trained model with encoder-decoder architecture for offensive language identification with text-to-text transformers (T5)
Our pre-trained T5 model outperforms other transformer-based models fine-tuned for offensive language detection, such as fBERT and HateBERT, in multiple English benchmarks.
Following a similar approach, we also train the first multilingual pre-trained model for offensive language identification using mT5.
arXiv Detail & Related papers (2023-12-06T09:37:27Z) - A comprehensive evaluation of ChatGPT's zero-shot Text-to-SQL capability [57.71052396828714]
This paper presents the first comprehensive analysis of ChatGPT's Text-to- abilities.
We conducted experiments on 12 benchmark datasets with different languages, settings, or scenarios.
Although there is still a gap from the current state-of-the-art (SOTA) model performance, ChatGPT's performance is still impressive.
arXiv Detail & Related papers (2023-03-12T04:22:01Z) - XRICL: Cross-lingual Retrieval-Augmented In-Context Learning for
Cross-lingual Text-to-SQL Semantic Parsing [70.40401197026925]
In-context learning using large language models has recently shown surprising results for semantic parsing tasks.
This work introduces the XRICL framework, which learns to retrieve relevant English exemplars for a given query.
We also include global translation exemplars for a target language to facilitate the translation process for large language models.
arXiv Detail & Related papers (2022-10-25T01:33:49Z) - Elaboration-Generating Commonsense Question Answering at Scale [77.96137534751445]
In question answering requiring common sense, language models (e.g., GPT-3) have been used to generate text expressing background knowledge.
We finetune smaller language models to generate useful intermediate context, referred to here as elaborations.
Our framework alternates between updating two language models -- an elaboration generator and an answer predictor -- allowing each to influence the other.
arXiv Detail & Related papers (2022-09-02T18:32:09Z) - Internet-augmented language models through few-shot prompting for
open-domain question answering [6.573232954655063]
We capitalize on the unique few-shot capabilities offered by large-scale language models to overcome some of their challenges.
We use few-shot prompting to learn to condition language models on information returned from the web using Google Search.
We find that language models conditioned on the web surpass performance of closed-book models of similar, or even larger, model sizes in open-domain question answering.
arXiv Detail & Related papers (2022-03-10T02:24:14Z) - Neural Models for Offensive Language Detection [0.0]
Offensive language detection is an ever-growing natural language processing (NLP) application.
We believe contributing to improving and comparing different machine learning models to fight such harmful contents is an important and challenging goal for this thesis.
arXiv Detail & Related papers (2021-05-30T13:02:45Z) - TuringAdvice: A Generative and Dynamic Evaluation of Language Use [90.3029315711237]
We propose TuringAdvice, a new challenge task and dataset for language understanding models.
Given a written situation that a real person is currently facing, a model must generate helpful advice in natural language.
Empirical results show that today's models struggle at TuringAdvice.
arXiv Detail & Related papers (2020-04-07T18:00:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.