In-Context Learning for Long-Context Sentiment Analysis on Infrastructure Project Opinions
- URL: http://arxiv.org/abs/2410.11265v1
- Date: Tue, 15 Oct 2024 04:42:21 GMT
- Title: In-Context Learning for Long-Context Sentiment Analysis on Infrastructure Project Opinions
- Authors: Alireza Shamshiri, Kyeong Rok Ryu, June Young Park,
- Abstract summary: This study evaluates the performance of three leading large language models: GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro.
Our results indicate that GPT-4o excels in zero-shot scenarios for simpler, shorter documents, while Claude 3.5 Sonnet surpasses GPT-4o in handling more complex, sentiment-fluctuating opinions.
- Score: 2.974480694911691
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have achieved impressive results across various tasks. However, they still struggle with long-context documents. This study evaluates the performance of three leading LLMs: GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro on lengthy, complex, and opinion-varying documents concerning infrastructure projects, under both zero-shot and few-shot scenarios. Our results indicate that GPT-4o excels in zero-shot scenarios for simpler, shorter documents, while Claude 3.5 Sonnet surpasses GPT-4o in handling more complex, sentiment-fluctuating opinions. In few-shot scenarios, Claude 3.5 Sonnet outperforms overall, while GPT-4o shows greater stability as the number of demonstrations increases.
Related papers
- Evaluating Prompting Strategies and Large Language Models in Systematic Literature Review Screening: Relevance and Task-Stage Classification [1.2234742322758418]
This study quantifies how prompting strategies interact with large language models (LLMs) to automate the screening stage of systematic literature reviews.<n>We evaluate six LLMs (GPT-4o, GPT-4o-mini, DeepSeek-Chat-V3, Gemini-2.5-Flash, Claude-3.5-Haiku, Llama-4-Maverick) under five prompt types.<n>CoT-few-shot yields the most reliable precision-recall balance; zero-shot maximizes recall for high-sensitivity passes; and self-reflection underperforms due to over-inclusivity and instability across models.
arXiv Detail & Related papers (2025-10-17T16:53:09Z) - Let's Use ChatGPT To Write Our Paper! Benchmarking LLMs To Write the Introduction of a Research Paper [64.50822834679101]
SciIG is a task that evaluates LLMs' ability to produce coherent introductions from titles, abstracts, and related works.<n>We assess five state-of-the-art models, including open-source (DeepSeek-v3, Gemma-3-12B, LLaMA 4-Maverick, MistralAI Small 3.1) and closed-source GPT-4o systems.<n>Results demonstrate LLaMA-4 Maverick's superior performance on most metrics, particularly in semantic similarity and faithfulness.
arXiv Detail & Related papers (2025-08-19T21:11:11Z) - Reasoning about Affordances: Causal and Compositional Reasoning in LLMs [0.0]
We investigate the causal and compositional reasoning abilities of Large Language Models (LLMs) and humans in the domain of object affordances.
In Experiment 1, we evaluated GPT-3.5 and GPT-4o, finding that GPT-4o performed on par with human participants, while GPT-3.5 lagged significantly.
In Experiment 2, we introduced two new conditions, Distractor and Image, and evaluated Claude 3 Sonnet and Claude 3.5 Sonnet in addition to the GPT models.
The Distractor condition significantly impaired performance across humans and models, although GPT-4o and Claude 3.5 still performed well above
arXiv Detail & Related papers (2025-02-23T15:21:47Z) - Counting-Stars: A Multi-evidence, Position-aware, and Scalable Benchmark for Evaluating Long-Context Large Language Models [14.906150451947443]
textbfCounting-Stars is a multi-evidence, position-aware, and scalable benchmark designed to evaluate the multi-evidence retrieval capabilities of long-context LLMs.<n>We conduct experiments to evaluate several long-context LLMs, including GPT-4 Turbo, Gemini 1.5 Pro, Claude3 Opus, GLM-4, and Moonshot-v1.<n> Gemini 1.5 Pro achieves the best overall results, while GPT-4 Turbo exhibits the most stable performance across various tasks.
arXiv Detail & Related papers (2024-03-18T14:01:45Z) - Can Large Language Models do Analytical Reasoning? [45.69642663863077]
This paper explores the cutting-edge Large Language Model with analytical reasoning on sports.
We find that GPT-4 stands out in effectiveness, followed by Claude-2.1, with GPT-3.5, Gemini-Pro, and Llama-2-70b lagging behind.
To our surprise, we observe that most models, including GPT-4, struggle to accurately count the total scores for NBA quarters despite showing strong performance in counting NFL quarter scores.
arXiv Detail & Related papers (2024-03-06T20:22:08Z) - Gemini vs GPT-4V: A Preliminary Comparison and Combination of
Vision-Language Models Through Qualitative Cases [98.35348038111508]
This paper presents an in-depth comparative study of two pioneering models: Google's Gemini and OpenAI's GPT-4V(ision)
The core of our analysis delves into the distinct visual comprehension abilities of each model.
Our findings illuminate the unique strengths and niches of both models.
arXiv Detail & Related papers (2023-12-22T18:59:58Z) - A Comparative Analysis of Large Language Models for Code Documentation Generation [1.9282110216621835]
The paper evaluates models such as GPT-3.5, GPT-4, Bard, Llama2, and Starchat on various parameters like Accuracy, Completeness, Relevance, Understandability, Readability and Time Taken.
arXiv Detail & Related papers (2023-12-16T06:40:09Z) - Large language models for aspect-based sentiment analysis [0.0]
We assess the performance of GPT-4 and GPT-3.5 in zero shot, few shot and fine-tuned settings.
Fine-tuned GPT-3.5 achieves a state-of-the-art F1 score of 83.8 on the joint aspect term extraction and polarity classification task.
arXiv Detail & Related papers (2023-10-27T10:03:21Z) - Prompt Engineering or Fine Tuning: An Empirical Assessment of Large
Language Models in Automated Software Engineering Tasks [8.223311621898983]
GPT-4 with conversational prompts showed drastic improvement compared to GPT-4 with automatic prompting strategies.
fully automated prompt engineering with no human in the loop requires more study and improvement.
arXiv Detail & Related papers (2023-10-11T00:21:00Z) - GPT-3.5, GPT-4, or BARD? Evaluating LLMs Reasoning Ability in Zero-Shot
Setting and Performance Boosting Through Prompts [0.0]
Large Language Models (LLMs) have exhibited remarkable performance on various Natural Language Processing (NLP) tasks.
In this paper, we examine the performance of GPT-3.5, GPT-4, and BARD models, by performing a thorough technical evaluation on different reasoning tasks.
arXiv Detail & Related papers (2023-05-21T14:45:17Z) - Document-Level Machine Translation with Large Language Models [91.03359121149595]
Large language models (LLMs) can produce coherent, cohesive, relevant, and fluent answers for various natural language processing (NLP) tasks.
This paper provides an in-depth evaluation of LLMs' ability on discourse modeling.
arXiv Detail & Related papers (2023-04-05T03:49:06Z) - GPT-4 Technical Report [116.90398195245983]
GPT-4 is a large-scale, multimodal model which can accept image and text inputs and produce text outputs.
It exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.
arXiv Detail & Related papers (2023-03-15T17:15:04Z) - Prompting GPT-3 To Be Reliable [117.23966502293796]
This work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality.
We find that GPT-3 outperforms smaller-scale supervised models by large margins on all these facets.
arXiv Detail & Related papers (2022-10-17T14:52:39Z) - Reframing Instructional Prompts to GPTk's Language [72.69833640335519]
We propose reframing techniques for model designers to create effective prompts for language models.
Our results show that reframing improves few-shot learning performance by 14% while reducing sample complexity.
The performance gains are particularly important on large language models, such as GPT3 where tuning models or prompts on large datasets is not feasible.
arXiv Detail & Related papers (2021-09-16T09:44:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.