5W1H Extraction With Large Language Models
- URL: http://arxiv.org/abs/2405.16150v1
- Date: Sat, 25 May 2024 09:42:58 GMT
- Title: 5W1H Extraction With Large Language Models
- Authors: Yang Cao, Yangsong Lan, Feiyan Zhai, Piji Li,
- Abstract summary: The extraction of essential news elements through the 5W1H framework is critical for event extraction and text summarization.
ChatGPT has encountered challenges in processing longer news texts and analyzing specific attributes in context.
We design several strategies from zero-shot/few-shot prompting to efficient fine-tuning to conduct 5W1H aspects extraction from the original news documents.
- Score: 27.409473072672277
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: The extraction of essential news elements through the 5W1H framework (\textit{What}, \textit{When}, \textit{Where}, \textit{Why}, \textit{Who}, and \textit{How}) is critical for event extraction and text summarization. The advent of Large language models (LLMs) such as ChatGPT presents an opportunity to address language-related tasks through simple prompts without fine-tuning models with much time. While ChatGPT has encountered challenges in processing longer news texts and analyzing specific attributes in context, especially answering questions about \textit{What}, \textit{Why}, and \textit{How}. The effectiveness of extraction tasks is notably dependent on high-quality human-annotated datasets. However, the absence of such datasets for the 5W1H extraction increases the difficulty of fine-tuning strategies based on open-source LLMs. To address these limitations, first, we annotate a high-quality 5W1H dataset based on four typical news corpora (\textit{CNN/DailyMail}, \textit{XSum}, \textit{NYT}, \textit{RA-MDS}); second, we design several strategies from zero-shot/few-shot prompting to efficient fine-tuning to conduct 5W1H aspects extraction from the original news documents. The experimental results demonstrate that the performance of the fine-tuned models on our labelled dataset is superior to the performance of ChatGPT. Furthermore, we also explore the domain adaptation capability by testing the source-domain (e.g. NYT) models on the target domain corpus (e.g. CNN/DailyMail) for the task of 5W1H extraction.
Related papers
- Towards Enhancing Coherence in Extractive Summarization: Dataset and Experiments with LLMs [70.15262704746378]
We propose a systematically created human-annotated dataset consisting of coherent summaries for five publicly available datasets and natural language user feedback.
Preliminary experiments with Falcon-40B and Llama-2-13B show significant performance improvements (10% Rouge-L) in terms of producing coherent summaries.
arXiv Detail & Related papers (2024-07-05T20:25:04Z) - TextSquare: Scaling up Text-Centric Visual Instruction Tuning [64.55339431760727]
We introduce a new approach for creating a massive, high-quality instruction-tuning dataset, Square-10M.
Our model, TextSquare, considerably surpasses open-source previous state-of-the-art Text-centric MLLMs.
It even outperforms top-tier models like GPT4V and Gemini in 6 of 10 text-centric benchmarks.
arXiv Detail & Related papers (2024-04-19T11:38:08Z) - Hypertext Entity Extraction in Webpage [112.56734676713721]
We introduce a textbfMoE-based textbfEntity textbfExtraction textbfFramework (textitMoEEF), which integrates multiple features to enhance model performance.
We also analyze the effectiveness of hypertext features in textitHEED and several model components in textitMoEEF.
arXiv Detail & Related papers (2024-03-04T03:21:40Z) - Empower Text-Attributed Graphs Learning with Large Language Models
(LLMs) [5.920353954082262]
We propose a plug-and-play approach to empower text-attributed graphs through node generation using Large Language Models (LLMs)
We employ an edge predictor to capture the structural information inherent in the raw dataset and integrate the newly generated samples into the original graph.
Experiments demonstrate the outstanding performance of our proposed paradigm, particularly in low-shot scenarios.
arXiv Detail & Related papers (2023-10-15T16:04:28Z) - Evaluation of Transfer Learning for Polish with a Text-to-Text Model [54.81823151748415]
We introduce a new benchmark for assessing the quality of text-to-text models for Polish.
The benchmark consists of diverse tasks and datasets: KLEJ benchmark adapted for text-to-text, en-pl translation, summarization, and question answering.
We present plT5 - a general-purpose text-to-text model for Polish that can be fine-tuned on various Natural Language Processing (NLP) tasks with a single training objective.
arXiv Detail & Related papers (2022-05-18T09:17:14Z) - A Semi-Supervised Deep Clustering Pipeline for Mining Intentions From
Texts [6.599344783327053]
Verint Manager Intent (VIM) is an analysis platform that combines unsupervised and semi-supervised approaches to help analysts quickly surface and organize relevant user intentions from conversational texts.
For the initial exploration of data we make use of a novel unsupervised and semi-supervised pipeline that integrates the fine-tuning of high performing language models.
BERT produces better task-aware representations using a labeled subset as small as 0.5% of the task data.
arXiv Detail & Related papers (2022-02-01T23:01:05Z) - A Flexible Clustering Pipeline for Mining Text Intentions [6.599344783327053]
We create a flexible and scalable clustering pipeline within the Verint Intent Manager.
It integrates the fine-tuning of language models, a high performing k-NN library and community detection techniques.
As deployed in the VIM application, this clustering pipeline produces high quality results.
arXiv Detail & Related papers (2022-02-01T22:54:18Z) - TopNet: Learning from Neural Topic Model to Generate Long Stories [43.5564336855688]
Long story generation (LSG) is one of the coveted goals in natural language processing.
We propose emphTopNet to obtain high-quality skeleton words to complement the short input.
Our proposed framework is highly effective in skeleton word selection and significantly outperforms state-of-the-art models in both automatic evaluation and human evaluation.
arXiv Detail & Related papers (2021-12-14T09:47:53Z) - HTLM: Hyper-Text Pre-Training and Prompting of Language Models [52.32659647159799]
We introduce HTLM, a hyper-text language model trained on a large-scale web crawl.
We show that pretraining with a BART-style denoising loss directly on simplified HTML provides highly effective transfer for a wide range of end tasks and supervision levels.
We find that hyper-text prompts provide more value to HTLM, in terms of data efficiency, than plain text prompts do for existing LMs.
arXiv Detail & Related papers (2021-07-14T19:39:31Z) - Extractive Summarization as Text Matching [123.09816729675838]
This paper creates a paradigm shift with regard to the way we build neural extractive summarization systems.
We formulate the extractive summarization task as a semantic text matching problem.
We have driven the state-of-the-art extractive result on CNN/DailyMail to a new level (44.41 in ROUGE-1)
arXiv Detail & Related papers (2020-04-19T08:27:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.