Related papers: A comprehensive study of LLM-based argument classification: from LLAMA through GPT-4o to Deepseek-R1

A comprehensive study of LLM-based argument classification: from LLAMA through GPT-4o to Deepseek-R1

URL: http://arxiv.org/abs/2507.08621v2
Date: Thu, 24 Jul 2025 11:49:06 GMT
Title: A comprehensive study of LLM-based argument classification: from LLAMA through GPT-4o to Deepseek-R1
Authors: Marcin Pietroń, Rafał Olszowski, Jakub Gomułka, Filip Gampel, Andrzej Tomski,
Abstract summary: Large language models (LLMs) have enhanced the efficiency of analyzing and extracting argument semantics.<n>This paper presents a study of a selection of LLM's, using diverse datasets such as Args.me and UKP.<n>The results indicate that ChatGPT-4o outperforms the others in the argument classification benchmarks.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Argument mining (AM) is an interdisciplinary research field that integrates insights from logic, philosophy, linguistics, rhetoric, law, psychology, and computer science. It involves the automatic identification and extraction of argumentative components, such as premises and claims, and the detection of relationships between them, such as support, attack, or neutrality. Recently, the field has advanced significantly, especially with the advent of large language models (LLMs), which have enhanced the efficiency of analyzing and extracting argument semantics compared to traditional methods and other deep learning models. There are many benchmarks for testing and verifying the quality of LLM, but there is still a lack of research and results on the operation of these models in publicly available argument classification databases. This paper presents a study of a selection of LLM's, using diverse datasets such as Args.me and UKP. The models tested include versions of GPT, Llama, and DeepSeek, along with reasoning-enhanced variants incorporating the Chain-of-Thoughts algorithm. The results indicate that ChatGPT-4o outperforms the others in the argument classification benchmarks. In case of models incorporated with reasoning capabilities, the Deepseek-R1 shows its superiority. However, despite their superiority, GPT-4o and Deepseek-R1 still make errors. The most common errors are discussed for all models. To our knowledge, the presented work is the first broader analysis of the mentioned datasets using LLM and prompt algorithms. The work also shows some weaknesses of known prompt algorithms in argument analysis, while indicating directions for their improvement. The added value of the work is the in-depth analysis of the available argument datasets and the demonstration of their shortcomings.

Related papers

Verifying Large Language Models' Reasoning Paths via Correlation Matrix Rank [71.09032766271493]
Large language models (LLMs) are prone to errors and hallucinations.<n>How to check their outputs effectively and efficiently has become a critical problem in their applications.
arXiv Detail & Related papers (2025-10-28T11:01:10Z)
JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models [51.99046112135311]
We introduce JustLogic, a synthetically generated deductive reasoning benchmark for rigorous evaluation of Large Language Models (LLMs)<n>JustLogic is highly complex, capable of generating a diverse range of linguistic patterns, vocabulary, and argument structures.<n>Our experimental results reveal that (i) state-of-the-art (SOTA) reasoning LLMs perform on par or better than the human average but significantly worse than the human ceiling.
arXiv Detail & Related papers (2025-01-24T15:49:10Z)
Critical-Questions-of-Thought: Steering LLM reasoning with Argumentative Querying [0.3659498819753633]
State-of-the-art Large Language models (LLMs) continue to struggle when performing logical and mathematical reasoning.<n>This paper makes use of the notion of critical questions from the literature on argumentation theory, focusing in particular on Toulmin's model of argumentation.<n>We show that employing these critical questions can improve the reasoning capabilities of LLMs.
arXiv Detail & Related papers (2024-12-19T18:51:30Z)
A Comparative Study on Reasoning Patterns of OpenAI's o1 Model [69.08287909042421]
We show that OpenAI's o1 model has achieved the best performance on most datasets. We also provide a detailed analysis on several reasoning benchmarks.
arXiv Detail & Related papers (2024-10-17T15:09:03Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models. It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths. It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z)
Efficient argument classification with compact language models and ChatGPT-4 refinements [0.0]
This paper presents comparative studies between a few deep learning-based models in argument mining. The main novelty of this paper is the ensemble model which is based on BERT architecture and ChatGPT-4 as fine tuning model. The presented results show that BERT+ChatGPT-4 outperforms the rest of the models including other Transformer-based and LSTM-based models.
arXiv Detail & Related papers (2024-03-20T16:24:10Z)
LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models [63.14196038655506]
We introduce LogicAsker, a novel approach for evaluating and enhancing the logical reasoning capabilities of large language models (LLMs) Our methodology reveals significant gaps in LLMs' learning of logical rules, with identified reasoning failures ranging from 29% to 90% across different models. We leverage these findings to construct targeted demonstration examples and fine-tune data, notably enhancing logical reasoning in models like GPT-4o by up to 5%.
arXiv Detail & Related papers (2024-01-01T13:53:53Z)
Exploring the Potential of Large Language Models in Computational Argumentation [54.85665903448207]
Large language models (LLMs) have demonstrated impressive capabilities in understanding context and generating natural language. This work aims to embark on an assessment of LLMs, such as ChatGPT, Flan models, and LLaMA2 models, in both zero-shot and few-shot settings.
arXiv Detail & Related papers (2023-11-15T15:12:15Z)
A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning [73.77088902676306]
We take a closer look at the self-verification abilities of large language models (LLMs) in the context of logical reasoning. Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods.
arXiv Detail & Related papers (2023-11-14T07:13:10Z)
"I'd Like to Have an Argument, Please": Argumentative Reasoning in Large Language Models [0.0]
We evaluate two large language models (LLMs) ability to perform argumentative reasoning. We find that scoring-wise the LLMs match or surpass the SOTA in AM and APE. However, statistical analysis on the LLMs outputs when subject to small, yet still human-readable, alterations in the I/O representations showed that the models are not performing reasoning.
arXiv Detail & Related papers (2023-09-29T02:41:38Z)
Benchmarking Causal Study to Interpret Large Language Models for Source Code [6.301373791541809]
This paper introduces a benchmarking strategy named Galeras comprised of curated testbeds for three SE tasks. We illustrate the insights of our benchmarking strategy by conducting a case study on the performance of ChatGPT under distinct prompt engineering methods.
arXiv Detail & Related papers (2023-08-23T20:32:12Z)
Causal Reasoning and Large Language Models: Opening a New Frontier for Causality [29.433401785920065]
Large language models (LLMs) can generate causal arguments with high probability. LLMs may be used by human domain experts to save effort in setting up a causal analysis.
arXiv Detail & Related papers (2023-04-28T19:00:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.