Related papers: In-Context Learning for Long-Context Sentiment Analysis on Infrastructure Project Opinions

In-Context Learning for Long-Context Sentiment Analysis on Infrastructure Project Opinions

URL: http://arxiv.org/abs/2410.11265v1
Date: Tue, 15 Oct 2024 04:42:21 GMT
Title: In-Context Learning for Long-Context Sentiment Analysis on Infrastructure Project Opinions
Authors: Alireza Shamshiri, Kyeong Rok Ryu, June Young Park,
Abstract summary: This study evaluates the performance of three leading large language models: GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro. Our results indicate that GPT-4o excels in zero-shot scenarios for simpler, shorter documents, while Claude 3.5 Sonnet surpasses GPT-4o in handling more complex, sentiment-fluctuating opinions.
Score: 2.974480694911691
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have achieved impressive results across various tasks. However, they still struggle with long-context documents. This study evaluates the performance of three leading LLMs: GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro on lengthy, complex, and opinion-varying documents concerning infrastructure projects, under both zero-shot and few-shot scenarios. Our results indicate that GPT-4o excels in zero-shot scenarios for simpler, shorter documents, while Claude 3.5 Sonnet surpasses GPT-4o in handling more complex, sentiment-fluctuating opinions. In few-shot scenarios, Claude 3.5 Sonnet outperforms overall, while GPT-4o shows greater stability as the number of demonstrations increases.

Related papers

Evaluating Prompting Strategies and Large Language Models in Systematic Literature Review Screening: Relevance and Task-Stage Classification [1.2234742322758418]
This study quantifies how prompting strategies interact with large language models (LLMs) to automate the screening stage of systematic literature reviews.<n>We evaluate six LLMs (GPT-4o, GPT-4o-mini, DeepSeek-Chat-V3, Gemini-2.5-Flash, Claude-3.5-Haiku, Llama-4-Maverick) under five prompt types.<n>CoT-few-shot yields the most reliable precision-recall balance; zero-shot maximizes recall for high-sensitivity passes; and self-reflection underperforms due to over-inclusivity and instability across models.
arXiv Detail & Related papers (2025-10-17T16:53:09Z)
Let's Use ChatGPT To Write Our Paper! Benchmarking LLMs To Write the Introduction of a Research Paper [64.50822834679101]
SciIG is a task that evaluates LLMs' ability to produce coherent introductions from titles, abstracts, and related works.<n>We assess five state-of-the-art models, including open-source (DeepSeek-v3, Gemma-3-12B, LLaMA 4-Maverick, MistralAI Small 3.1) and closed-source GPT-4o systems.<n>Results demonstrate LLaMA-4 Maverick's superior performance on most metrics, particularly in semantic similarity and faithfulness.
arXiv Detail & Related papers (2025-08-19T21:11:11Z)
Reasoning about Affordances: Causal and Compositional Reasoning in LLMs [0.0]
We investigate the causal and compositional reasoning abilities of Large Language Models (LLMs) and humans in the domain of object affordances. In Experiment 1, we evaluated GPT-3.5 and GPT-4o, finding that GPT-4o performed on par with human participants, while GPT-3.5 lagged significantly. In Experiment 2, we introduced two new conditions, Distractor and Image, and evaluated Claude 3 Sonnet and Claude 3.5 Sonnet in addition to the GPT models. The Distractor condition significantly impaired performance across humans and models, although GPT-4o and Claude 3.5 still performed well above
arXiv Detail & Related papers (2025-02-23T15:21:47Z)
Counting-Stars: A Multi-evidence, Position-aware, and Scalable Benchmark for Evaluating Long-Context Large Language Models [14.906150451947443]
textbfCounting-Stars is a multi-evidence, position-aware, and scalable benchmark designed to evaluate the multi-evidence retrieval capabilities of long-context LLMs.<n>We conduct experiments to evaluate several long-context LLMs, including GPT-4 Turbo, Gemini 1.5 Pro, Claude3 Opus, GLM-4, and Moonshot-v1.<n> Gemini 1.5 Pro achieves the best overall results, while GPT-4 Turbo exhibits the most stable performance across various tasks.
arXiv Detail & Related papers (2024-03-18T14:01:45Z)
Can Large Language Models do Analytical Reasoning? [45.69642663863077]
This paper explores the cutting-edge Large Language Model with analytical reasoning on sports. We find that GPT-4 stands out in effectiveness, followed by Claude-2.1, with GPT-3.5, Gemini-Pro, and Llama-2-70b lagging behind. To our surprise, we observe that most models, including GPT-4, struggle to accurately count the total scores for NBA quarters despite showing strong performance in counting NFL quarter scores.
arXiv Detail & Related papers (2024-03-06T20:22:08Z)
Gemini vs GPT-4V: A Preliminary Comparison and Combination of Vision-Language Models Through Qualitative Cases [98.35348038111508]
This paper presents an in-depth comparative study of two pioneering models: Google's Gemini and OpenAI's GPT-4V(ision) The core of our analysis delves into the distinct visual comprehension abilities of each model. Our findings illuminate the unique strengths and niches of both models.
arXiv Detail & Related papers (2023-12-22T18:59:58Z)
A Comparative Analysis of Large Language Models for Code Documentation Generation [1.9282110216621835]
The paper evaluates models such as GPT-3.5, GPT-4, Bard, Llama2, and Starchat on various parameters like Accuracy, Completeness, Relevance, Understandability, Readability and Time Taken.
arXiv Detail & Related papers (2023-12-16T06:40:09Z)
Large language models for aspect-based sentiment analysis [0.0]
We assess the performance of GPT-4 and GPT-3.5 in zero shot, few shot and fine-tuned settings. Fine-tuned GPT-3.5 achieves a state-of-the-art F1 score of 83.8 on the joint aspect term extraction and polarity classification task.
arXiv Detail & Related papers (2023-10-27T10:03:21Z)
Prompt Engineering or Fine Tuning: An Empirical Assessment of Large Language Models in Automated Software Engineering Tasks [8.223311621898983]
GPT-4 with conversational prompts showed drastic improvement compared to GPT-4 with automatic prompting strategies. fully automated prompt engineering with no human in the loop requires more study and improvement.
arXiv Detail & Related papers (2023-10-11T00:21:00Z)
GPT-3.5, GPT-4, or BARD? Evaluating LLMs Reasoning Ability in Zero-Shot Setting and Performance Boosting Through Prompts [0.0]
Large Language Models (LLMs) have exhibited remarkable performance on various Natural Language Processing (NLP) tasks. In this paper, we examine the performance of GPT-3.5, GPT-4, and BARD models, by performing a thorough technical evaluation on different reasoning tasks.
arXiv Detail & Related papers (2023-05-21T14:45:17Z)
Document-Level Machine Translation with Large Language Models [91.03359121149595]
Large language models (LLMs) can produce coherent, cohesive, relevant, and fluent answers for various natural language processing (NLP) tasks. This paper provides an in-depth evaluation of LLMs' ability on discourse modeling.
arXiv Detail & Related papers (2023-04-05T03:49:06Z)
GPT-4 Technical Report [116.90398195245983]
GPT-4 is a large-scale, multimodal model which can accept image and text inputs and produce text outputs. It exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.
arXiv Detail & Related papers (2023-03-15T17:15:04Z)
Prompting GPT-3 To Be Reliable [117.23966502293796]
This work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality. We find that GPT-3 outperforms smaller-scale supervised models by large margins on all these facets.
arXiv Detail & Related papers (2022-10-17T14:52:39Z)
Reframing Instructional Prompts to GPTk's Language [72.69833640335519]
We propose reframing techniques for model designers to create effective prompts for language models. Our results show that reframing improves few-shot learning performance by 14% while reducing sample complexity. The performance gains are particularly important on large language models, such as GPT3 where tuning models or prompts on large datasets is not feasible.
arXiv Detail & Related papers (2021-09-16T09:44:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.