Which Side Are You On? A Multi-task Dataset for End-to-End Argument Summarisation and Evaluation
- URL: http://arxiv.org/abs/2406.03151v3
- Date: Tue, 20 Aug 2024 15:41:27 GMT
- Title: Which Side Are You On? A Multi-task Dataset for End-to-End Argument Summarisation and Evaluation
- Authors: Hao Li, Yuping Wu, Viktor Schlegel, Riza Batista-Navarro, Tharindu Madusanka, Iqra Zahid, Jiayan Zeng, Xiaochi Wang, Xinran He, Yizhi Li, Goran Nenadic,
- Abstract summary: We introduce an argument mining dataset that captures the end-to-end process of preparing an argumentative essay for a debate.
Our dataset contains 14k examples of claims that are fully annotated with the various properties supporting the aforementioned tasks.
We find, that while they show promising results on individual tasks in our benchmark, their end-to-end performance on all four tasks in succession deteriorates significantly.
- Score: 13.205613282888676
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: With the recent advances of large language models (LLMs), it is no longer infeasible to build an automated debate system that helps people to synthesise persuasive arguments. Previous work attempted this task by integrating multiple components. In our work, we introduce an argument mining dataset that captures the end-to-end process of preparing an argumentative essay for a debate, which covers the tasks of claim and evidence identification (Task 1 ED), evidence convincingness ranking (Task 2 ECR), argumentative essay summarisation and human preference ranking (Task 3 ASR) and metric learning for automated evaluation of resulting essays, based on human feedback along argument quality dimensions (Task 4 SQE). Our dataset contains 14k examples of claims that are fully annotated with the various properties supporting the aforementioned tasks. We evaluate multiple generative baselines for each of these tasks, including representative LLMs. We find, that while they show promising results on individual tasks in our benchmark, their end-to-end performance on all four tasks in succession deteriorates significantly, both in automated measures as well as in human-centred evaluation. This challenge presented by our proposed dataset motivates future research on end-to-end argument mining and summarisation. The repository of this project is available at https://github.com/HaoBytes/ArgSum-Datatset
Related papers
- Assessing Good, Bad and Ugly Arguments Generated by ChatGPT: a New Dataset, its Methodology and Associated Tasks [1.6427658855248812]
This paper introduces a methodology to obtain good, bad and ugly arguments from argumentative essays produced by ChatGPT.
We then describe a novel dataset containing a set of diverse arguments, ArGPT.
We show that the artificially generated data relates well to human argumentation and thus is useful as a tool to train and test systems for the defined tasks.
arXiv Detail & Related papers (2024-06-21T13:27:10Z) - Exploring the Potential of Large Language Models in Computational Argumentation [54.85665903448207]
Large language models (LLMs) have demonstrated impressive capabilities in understanding context and generating natural language.
This work aims to embark on an assessment of LLMs, such as ChatGPT, Flan models, and LLaMA2 models, in both zero-shot and few-shot settings.
arXiv Detail & Related papers (2023-11-15T15:12:15Z) - UniSumm and SummZoo: Unified Model and Diverse Benchmark for Few-Shot
Summarization [54.59104881168188]
textscUniSumm is a unified few-shot summarization model pre-trained with multiple summarization tasks.
textscSummZoo is a new benchmark to better evaluate few-shot summarizers.
arXiv Detail & Related papers (2022-11-17T18:54:47Z) - Full-Text Argumentation Mining on Scientific Publications [3.8754200816873787]
We introduce a sequential pipeline model combining ADUR and ARE for full-text SAM.
We provide a first analysis of the performance of pretrained language models (PLMs) on both subtasks.
Our detailed error analysis reveals that non-contiguous ADUs as well as the interpretation of discourse connectors pose major challenges.
arXiv Detail & Related papers (2022-10-24T10:05:30Z) - Task Compass: Scaling Multi-task Pre-training with Task Prefix [122.49242976184617]
Existing studies show that multi-task learning with large-scale supervised tasks suffers from negative effects across tasks.
We propose a task prefix guided multi-task pre-training framework to explore the relationships among tasks.
Our model can not only serve as the strong foundation backbone for a wide range of tasks but also be feasible as a probing tool for analyzing task relationships.
arXiv Detail & Related papers (2022-10-12T15:02:04Z) - Don't Copy the Teacher: Data and Model Challenges in Embodied Dialogue [92.01165203498299]
Embodied dialogue instruction following requires an agent to complete a complex sequence of tasks from a natural language exchange.
This paper argues that imitation learning (IL) and related low-level metrics are actually misleading and do not align with the goals of embodied dialogue research.
arXiv Detail & Related papers (2022-10-10T05:51:40Z) - Diversity Over Size: On the Effect of Sample and Topic Sizes for Topic-Dependent Argument Mining Datasets [49.65208986436848]
We investigate the effect of Argument Mining dataset composition in few- and zero-shot settings.
Our findings show that, while fine-tuning is mandatory to achieve acceptable model performance, using carefully composed training samples and reducing the training sample size by up to almost 90% can still yield 95% of the maximum performance.
arXiv Detail & Related papers (2022-05-23T17:14:32Z) - Automated Evaluation for Student Argumentative Writing: A Survey [2.9466390764652415]
This paper surveys and organizes research works in an under-studied area, which we call automated evaluation for student argumentative writing.
Unlike traditional automated writing evaluation that focuses on holistic essay scoring, this field is more specific: it focuses on evaluating argumentative essays and offers specific feedback.
arXiv Detail & Related papers (2022-05-09T07:27:59Z) - IAM: A Comprehensive and Large-Scale Dataset for Integrated Argument
Mining Tasks [59.457948080207174]
In this work, we introduce a comprehensive and large dataset named IAM, which can be applied to a series of argument mining tasks.
Near 70k sentences in the dataset are fully annotated based on their argument properties.
We propose two new integrated argument mining tasks associated with the debate preparation process: (1) claim extraction with stance classification (CESC) and (2) claim-evidence pair extraction (CEPE)
arXiv Detail & Related papers (2022-03-23T08:07:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.