Sensitivity and Robustness of Large Language Models to Prompt Template
in Japanese Text Classification Tasks
- URL: http://arxiv.org/abs/2305.08714v2
- Date: Thu, 8 Jun 2023 02:14:20 GMT
- Title: Sensitivity and Robustness of Large Language Models to Prompt Template
in Japanese Text Classification Tasks
- Authors: Chengguang Gan and Tatsunori Mori
- Abstract summary: A critical issue has been identified within this domain: the inadequate sensitivity and robustness of large language models towards Prompt templates.
This paper explores this issue through a comprehensive evaluation of several representative Large Language Models (LLMs) and a widely-utilized pre-trained model(PLM)
Our experimental results reveal startling discrepancies. A simple modification in the sentence structure of the Prompt template led to a drastic drop in the accuracy of GPT-4 from 49.21 to 25.44.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prompt engineering relevance research has seen a notable surge in recent
years, primarily driven by advancements in pre-trained language models and
large language models. However, a critical issue has been identified within
this domain: the inadequate of sensitivity and robustness of these models
towards Prompt Templates, particularly in lesser-studied languages such as
Japanese. This paper explores this issue through a comprehensive evaluation of
several representative Large Language Models (LLMs) and a widely-utilized
pre-trained model(PLM). These models are scrutinized using a benchmark dataset
in Japanese, with the aim to assess and analyze the performance of the current
multilingual models in this context. Our experimental results reveal startling
discrepancies. A simple modification in the sentence structure of the Prompt
Template led to a drastic drop in the accuracy of GPT-4 from 49.21 to 25.44.
This observation underscores the fact that even the highly performance GPT-4
model encounters significant stability issues when dealing with diverse
Japanese prompt templates, rendering the consistency of the model's output
results questionable. In light of these findings, we conclude by proposing
potential research trajectories to further enhance the development and
performance of Large Language Models in their current stage.
Related papers
- Large corpora and large language models: a replicable method for automating grammatical annotation [0.0]
We introduce a methodological pipeline applied to the case study of formal variation in the English evaluative verb construction 'consider X (as) (to be) Y'
We reach a model accuracy of over 90% on our held-out test samples with only a small amount of training data.
We discuss the generalisability of our results for a wider range of case studies of grammatical constructions and grammatical variation and change.
arXiv Detail & Related papers (2024-11-18T03:29:48Z) - The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.40766216456413]
We propose a question alignment framework to bridge the gap between large language models' English and non-English performance.
Experiment results show it can boost multilingual performance across diverse reasoning scenarios, model families, and sizes.
We analyze representation space, generated response and data scales, and reveal how question translation training strengthens language alignment within LLMs.
arXiv Detail & Related papers (2024-05-02T14:49:50Z) - On the Analysis of Cross-Lingual Prompt Tuning for Decoder-based
Multilingual Model [49.81429697921861]
We study the interaction between parameter-efficient fine-tuning (PEFT) and cross-lingual tasks in multilingual autoregressive models.
We show that prompt tuning is more effective in enhancing the performance of low-resource languages than fine-tuning.
arXiv Detail & Related papers (2023-11-14T00:43:33Z) - Multilingual and Multi-topical Benchmark of Fine-tuned Language models and Large Language Models for Check-Worthy Claim Detection [1.4779899760345434]
This study compares the performance of (1) fine-tuned language models and (2) large language models on the task of check-worthy claim detection.
We composed a multilingual and multi-topical dataset comprising texts of various sources and styles.
arXiv Detail & Related papers (2023-11-10T15:36:35Z) - Evaluating Large Language Models on Controlled Generation Tasks [92.64781370921486]
We present an extensive analysis of various benchmarks including a sentence planning benchmark with different granularities.
After comparing large language models against state-of-the-start finetuned smaller models, we present a spectrum showing large language models falling behind, are comparable, or exceed the ability of smaller models.
arXiv Detail & Related papers (2023-10-23T03:48:24Z) - A Multi-dimensional Evaluation of Tokenizer-free Multilingual Pretrained
Models [87.7086269902562]
We show that subword-based models might still be the most practical choice in many settings.
We encourage future work in tokenizer-free methods to consider these factors when designing and evaluating new models.
arXiv Detail & Related papers (2022-10-13T15:47:09Z) - PaLM: Scaling Language Modeling with Pathways [180.69584031908113]
We trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM.
We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods.
We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks.
arXiv Detail & Related papers (2022-04-05T16:11:45Z) - A Systematic Investigation of Commonsense Understanding in Large
Language Models [23.430757316504316]
Large language models have shown impressive performance on many natural language processing (NLP) tasks in a zero-shot setting.
We ask whether these models exhibit commonsense understanding by evaluating models against four commonsense benchmarks.
arXiv Detail & Related papers (2021-10-31T22:20:36Z) - Language Models are Few-shot Multilingual Learners [66.11011385895195]
We evaluate the multilingual skills of the GPT and T5 models in conducting multi-class classification on non-English languages.
We show that, given a few English examples as context, pre-trained language models can predict not only English test samples but also non-English ones.
arXiv Detail & Related papers (2021-09-16T03:08:22Z) - Comparative Study of Language Models on Cross-Domain Data with Model
Agnostic Explainability [0.0]
The study compares the state-of-the-art language models - BERT, ELECTRA and its derivatives which include RoBERTa, ALBERT and DistilBERT.
The experimental results establish new state-of-the-art for 2013 rating classification task and Financial Phrasebank sentiment detection task with 69% accuracy and 88.2% accuracy respectively.
arXiv Detail & Related papers (2020-09-09T04:31:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.