Related papers: ELECTRA and GPT-4o: Cost-Effective Partners for Sentiment Analysis

ELECTRA and GPT-4o: Cost-Effective Partners for Sentiment Analysis

URL: http://arxiv.org/abs/2501.00062v1
Date: Sun, 29 Dec 2024 05:29:52 GMT
Title: ELECTRA and GPT-4o: Cost-Effective Partners for Sentiment Analysis
Authors: James P. Beno,
Abstract summary: This paper explores collaborative approaches between ELECTRA and GPT-4o for three-way sentiment classification.<n>We fine-tuned four models using a mix of reviews from Stanford Sentiment Treebank (SST) and DynaSent.<n>Our results show that augmenting prompts with predictions from fine-tuned encoders is an efficient way to boost performance.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Bidirectional transformers excel at sentiment analysis, and Large Language Models (LLM) are effective zero-shot learners. Might they perform better as a team? This paper explores collaborative approaches between ELECTRA and GPT-4o for three-way sentiment classification. We fine-tuned (FT) four models (ELECTRA Base/Large, GPT-4o/4o-mini) using a mix of reviews from Stanford Sentiment Treebank (SST) and DynaSent. We provided input from ELECTRA to GPT as: predicted label, probabilities, and retrieved examples. Sharing ELECTRA Base FT predictions with GPT-4o-mini significantly improved performance over either model alone (82.74 macro F1 vs. 79.29 ELECTRA Base FT, 79.52 GPT-4o-mini) and yielded the lowest cost/performance ratio (\$0.12/F1 point). However, when GPT models were fine-tuned, including predictions decreased performance. GPT-4o FT-M was the top performer (86.99), with GPT-4o-mini FT close behind (86.77) at much less cost (\$0.38 vs. \$1.59/F1 point). Our results show that augmenting prompts with predictions from fine-tuned encoders is an efficient way to boost performance, and a fine-tuned GPT-4o-mini is nearly as good as GPT-4o FT at 76% less cost. Both are affordable options for projects with limited resources.

Related papers

Utilizing Large Language Models for Named Entity Recognition in Traditional Chinese Medicine against COVID-19 Literature: Comparative Study [4.680391123850371]
We established a dataset of 389 articles on TCM against COVID-19, and manually annotated 48 of them with 6 types of entities belonging to 3 domains as the ground truth. We then performed NER tasks for the 6 entity types using ChatGPT (GPT-3.5 and GPT-4) and 4 state-of-the-art BERT-based question-answering (QA) models.
arXiv Detail & Related papers (2024-08-24T06:59:55Z)
Detect Llama -- Finding Vulnerabilities in Smart Contracts using Large Language Models [27.675558033502565]
We fine-tune open-source models to outperform GPT-4 in smart contract vulnerability detection. For binary classification (i.e., is this smart contract vulnerable?), our two best-performing models, GPT-3.5FT and Detect Llama - Foundation, achieve F1 scores of. For the evaluation against individual vulnerability identification, our top two models, GPT-3.5FT and Detect Llama - Foundation, both significantly outperformed GPT-4 and GPT-4 Turbo.
arXiv Detail & Related papers (2024-07-12T03:33:13Z)
GPT vs RETRO: Exploring the Intersection of Retrieval and Parameter-Efficient Fine-Tuning [48.71952325015267]
We apply PEFT methods to a modified Retrieval-Enhanced Transformer (RETRO) and a baseline GPT model across several sizes. We show that RETRO models outperform GPT models in zero-shot settings due to their unique pre-training process. This work presents the first comprehensive comparison of various PEFT methods integrated with RAG, applied to both GPT and RETRO models.
arXiv Detail & Related papers (2024-07-05T14:16:47Z)
Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games [56.70628673595041]
Large Language Models (LLMs) have been increasingly used in real-world settings, yet their strategic decision-making abilities remain largely unexplored. This work investigates the performance and merits of LLMs in canonical game-theoretic two-player non-zero-sum games, Stag Hunt and Prisoner Dilemma. Our structured evaluation of GPT-3.5, GPT-4-Turbo, GPT-4o, and Llama-3-8B shows that these models, when making decisions in these games, are affected by at least one of the following systematic biases.
arXiv Detail & Related papers (2024-07-05T12:30:02Z)
Edinburgh Clinical NLP at SemEval-2024 Task 2: Fine-tune your model unless you have access to GPT-4 [10.01547158445743]
We evaluate various Large Language Models (LLMs) with multiple strategies, including Chain-of-Thought, In-Context Learning, and Efficient Fine-Tuning (PEFT) We found that the two PEFT adapters improves the F1 score (+0.0346) and consistency (+0.152) of the LLMs. Averaging the three metrics, GPT-4 ranks joint-first in the competition with 0.8328.
arXiv Detail & Related papers (2024-03-30T22:27:21Z)
ORPO: Monolithic Preference Optimization without Reference Model [9.53888551630878]
We study the crucial role of supervised fine-tuning within the context of preference alignment. We introduce a model-free monolithic odds ratio preference optimization algorithm, ORPO, eliminating the necessity for an additional preference alignment phase. Specifically, fine-tuning Phi-2 (2.7B), Llama-2 (7B), and Mistral (7B) with ORPO on the UltraFeedback surpasses the performance of state-of-the-art language models with more than 7B and 13B parameters.
arXiv Detail & Related papers (2024-03-12T14:34:08Z)
Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation [50.00235162432848]
We train ALMA models with only 22K parallel sentences and 12M parameters. The resulting model, called ALMA-R, can match or exceed the performance of the WMT competition winners and GPT-4.
arXiv Detail & Related papers (2024-01-16T15:04:51Z)
ExtractGPT: Exploring the Potential of Large Language Models for Product Attribute Value Extraction [52.14681890859275]
E-commerce platforms require structured product data in the form of attribute-value pairs. BERT-based extraction methods require large amounts of task-specific training data. This paper explores using large language models (LLMs) as a more training-data efficient and robust alternative.
arXiv Detail & Related papers (2023-10-19T07:39:00Z)
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models [92.6951708781736]
This work proposes a comprehensive trustworthiness evaluation for large language models with a focus on GPT-4 and GPT-3.5. We find that GPT models can be easily misled to generate toxic and biased outputs and leak private information. Our work illustrates a comprehensive trustworthiness evaluation of GPT models and sheds light on the trustworthiness gaps.
arXiv Detail & Related papers (2023-06-20T17:24:23Z)
GPT-4 Technical Report [116.90398195245983]
GPT-4 is a large-scale, multimodal model which can accept image and text inputs and produce text outputs. It exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.
arXiv Detail & Related papers (2023-03-15T17:15:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.