WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions
from Paragraphs
- URL: http://arxiv.org/abs/2209.13101v1
- Date: Tue, 27 Sep 2022 01:28:02 GMT
- Title: WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions
from Paragraphs
- Authors: Hoang Thang Ta, Abu Bakar Siddiqur Rahman, Navonil Majumder, Amir
Hussain, Lotfollah Najjar, Newton Howard, Soujanya Poria and Alexander
Gelbukh
- Abstract summary: We introduce WikiDes, a dataset to generate short descriptions of Wikipedia articles.
The dataset consists of over 80k English samples on 6987 topics.
Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions.
- Score: 66.88232442007062
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As free online encyclopedias with massive volumes of content, Wikipedia and
Wikidata are key to many Natural Language Processing (NLP) tasks, such as
information retrieval, knowledge base building, machine translation, text
classification, and text summarization. In this paper, we introduce WikiDes, a
novel dataset to generate short descriptions of Wikipedia articles for the
problem of text summarization. The dataset consists of over 80k English samples
on 6987 topics. We set up a two-phase summarization method - description
generation (Phase I) and candidate ranking (Phase II) - as a strong approach
that relies on transfer and contrastive learning. For description generation,
T5 and BART show their superiority compared to other small-scale pre-trained
models. By applying contrastive learning with the diverse input from beam
search, the metric fusion-based ranking models outperform the direct
description generation models significantly up to 22 ROUGE in topic-exclusive
split and topic-independent split. Furthermore, the outcome descriptions in
Phase II are supported by human evaluation in over 45.33% chosen compared to
23.66% in Phase I against the gold descriptions. In the aspect of sentiment
analysis, the generated descriptions cannot effectively capture all sentiment
polarities from paragraphs while doing this task better from the gold
descriptions. The automatic generation of new descriptions reduces the human
efforts in creating them and enriches Wikidata-based knowledge graphs. Our
paper shows a practical impact on Wikipedia and Wikidata since there are
thousands of missing descriptions. Finally, we expect WikiDes to be a useful
dataset for related works in capturing salient information from short
paragraphs. The curated dataset is publicly available at:
https://github.com/declare-lab/WikiDes.
Related papers
- WikiIns: A High-Quality Dataset for Controlled Text Editing by Natural
Language Instruction [56.196512595940334]
We build and release WikiIns, a high-quality controlled text editing dataset with improved informativeness.
With the high-quality annotated dataset, we propose automatic approaches to generate a large-scale silver'' training set.
arXiv Detail & Related papers (2023-10-08T04:46:39Z) - WikiSQE: A Large-Scale Dataset for Sentence Quality Estimation in
Wikipedia [14.325320851640084]
We propose WikiSQE, the first large-scale dataset for sentence quality estimation in Wikipedia.
Each sentence is extracted from the entire revision history of English Wikipedia.
WikiSQE has about 3.4 M sentences with 153 quality labels.
arXiv Detail & Related papers (2023-05-10T06:45:13Z) - XWikiGen: Cross-lingual Summarization for Encyclopedic Text Generation
in Low Resource Languages [11.581072296148031]
Existing work on Wikipedia text generation has focused on English only where English reference articles are summarized to generate English Wikipedia pages.
We propose XWikiGen, which is the task of cross-lingual multi-document summarization of text from multiple reference articles, written in various languages, to generate Wikipedia-style text.
arXiv Detail & Related papers (2023-03-22T04:52:43Z) - Mapping Process for the Task: Wikidata Statements to Text as Wikipedia
Sentences [68.8204255655161]
We propose our mapping process for the task of converting Wikidata statements to natural language text (WS2T) for Wikipedia projects at the sentence level.
The main step is to organize statements, represented as a group of quadruples and triples, and then to map them to corresponding sentences in English Wikipedia.
We evaluate the output corpus in various aspects: sentence structure analysis, noise filtering, and relationships between sentence components based on word embedding models.
arXiv Detail & Related papers (2022-10-23T08:34:33Z) - TWAG: A Topic-Guided Wikipedia Abstract Generator [23.937804531845938]
Wikipedia abstract generation aims to distill a Wikipedia abstract from web sources and has met significant success.
Previous works generally view the abstract as plain text, ignoring the fact that it is a description of a certain entity and can be decomposed into different topics.
We propose a two-stage model TWAG that guides the abstract generation with topical information.
arXiv Detail & Related papers (2021-06-29T07:42:08Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z) - Learning to Summarize Passages: Mining Passage-Summary Pairs from
Wikipedia Revision Histories [110.54963847339775]
We propose a method for automatically constructing a passage-to-summary dataset by mining the Wikipedia page revision histories.
In particular, the method mines the main body passages and the introduction sentences which are added to the pages simultaneously.
The constructed dataset contains more than one hundred thousand passage-summary pairs.
arXiv Detail & Related papers (2020-04-06T12:11:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.