OASum: Large-Scale Open Domain Aspect-based Summarization
- URL: http://arxiv.org/abs/2212.09233v2
- Date: Thu, 25 May 2023 22:29:45 GMT
- Title: OASum: Large-Scale Open Domain Aspect-based Summarization
- Authors: Xianjun Yang, Kaiqiang Song, Sangwoo Cho, Xiaoyang Wang, Xiaoman Pan,
Linda Petzold, Dong Yu
- Abstract summary: We take advantage of crowd-sourcing knowledge on Wikipedia.org and automatically create a high-quality, large-scale aspect-based summarization dataset named OASum.
OASum contains more than 3.7 million instances with around 1 million different aspects on 2 million Wikipedia pages.
To overcome the data scarcity problem on specific domains, we also perform zero-shot, few-shot, and fine-tuning on seven downstream datasets.
- Score: 29.45232847592956
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Aspect or query-based summarization has recently caught more attention, as it
can generate differentiated summaries based on users' interests. However, the
current dataset for aspect or query-based summarization either focuses on
specific domains, contains relatively small-scale instances, or includes only a
few aspect types. Such limitations hinder further explorations in this
direction. In this work, we take advantage of crowd-sourcing knowledge on
Wikipedia.org and automatically create a high-quality, large-scale open-domain
aspect-based summarization dataset named OASum, which contains more than 3.7
million instances with around 1 million different aspects on 2 million
Wikipedia pages. We provide benchmark results on OASum and demonstrate its
ability for diverse aspect-based summarization generation. To overcome the data
scarcity problem on specific domains, we also perform zero-shot, few-shot, and
fine-tuning on seven downstream datasets. Specifically, zero/few-shot and
fine-tuning results show that the model pre-trained on our corpus demonstrates
a strong aspect or query-focused generation ability compared with the backbone
model. Our dataset and pre-trained checkpoints are publicly available.
Related papers
- Pedestrian Attribute Recognition: A New Benchmark Dataset and A Large Language Model Augmented Framework [15.991114464911844]
In the past five years, no large-scale dataset has been opened to the public.
This paper proposes a new large-scale, cross-domain pedestrian attribute recognition dataset, MSP60K.
It consists of 60,122 images and 57 attribute annotations across eight scenarios.
arXiv Detail & Related papers (2024-08-19T06:19:31Z) - Wiki Entity Summarization Benchmark [9.25319552487389]
Entity summarization aims to compute concise summaries for entities in knowledge graphs.
Existing datasets and benchmarks are often limited to a few hundred entities.
We propose WikES, a comprehensive benchmark comprising of entities, their summaries, and their connections.
arXiv Detail & Related papers (2024-06-12T17:22:00Z) - ACLSum: A New Dataset for Aspect-based Summarization of Scientific
Publications [10.529898520273063]
ACLSum is a novel summarization dataset carefully crafted and evaluated by domain experts.
In contrast to previous datasets, ACLSum facilitates multi-aspect summarization of scientific papers.
arXiv Detail & Related papers (2024-03-08T13:32:01Z) - Domain-Expanded ASTE: Rethinking Generalization in Aspect Sentiment Triplet Extraction [67.54420015049732]
Aspect Sentiment Triplet Extraction (ASTE) is a challenging task in sentiment analysis, aiming to provide fine-grained insights into human sentiments.
Existing benchmarks are limited to two domains and do not evaluate model performance on unseen domains.
We introduce a domain-expanded benchmark by annotating samples from diverse domains, enabling evaluation of models in both in-domain and out-of-domain settings.
arXiv Detail & Related papers (2023-05-23T18:01:49Z) - LMGQS: A Large-scale Dataset for Query-focused Summarization [77.6179359525065]
We convert four generic summarization benchmarks into a new QFS benchmark dataset, LMGQS.
We establish baselines with state-of-the-art summarization models.
We achieve state-of-the-art zero-shot and supervised performance on multiple existing QFS benchmarks.
arXiv Detail & Related papers (2023-05-22T14:53:45Z) - Combining Data Generation and Active Learning for Low-Resource Question Answering [23.755283239897132]
We propose a novel approach that combines data augmentation via question-answer generation with Active Learning to improve performance in low-resource settings.
Our findings show that our novel approach, where humans are incorporated in a data generation approach, boosts performance in the low-resource, domain-specific setting.
arXiv Detail & Related papers (2022-11-27T16:31:33Z) - Multi-Domain Long-Tailed Learning by Augmenting Disentangled
Representations [80.76164484820818]
There is an inescapable long-tailed class-imbalance issue in many real-world classification problems.
We study this multi-domain long-tailed learning problem and aim to produce a model that generalizes well across all classes and domains.
Built upon a proposed selective balanced sampling strategy, TALLY achieves this by mixing the semantic representation of one example with the domain-associated nuisances of another.
arXiv Detail & Related papers (2022-10-25T21:54:26Z) - Efficient Few-Shot Fine-Tuning for Opinion Summarization [83.76460801568092]
Abstractive summarization models are typically pre-trained on large amounts of generic texts, then fine-tuned on tens or hundreds of thousands of annotated samples.
We show that a few-shot method based on adapters can easily store in-domain knowledge.
We show that this self-supervised adapter pre-training improves summary quality over standard fine-tuning by 2.0 and 1.3 ROUGE-L points on the Amazon and Yelp datasets.
arXiv Detail & Related papers (2022-05-04T16:38:37Z) - WikiAsp: A Dataset for Multi-domain Aspect-based Summarization [69.13865812754058]
We propose WikiAsp, a large-scale dataset for multi-domain aspect-based summarization.
Specifically, we build the dataset using Wikipedia articles from 20 different domains, using the section titles and boundaries of each article as a proxy for aspect annotation.
Results highlight key challenges that existing summarization models face in this setting, such as proper pronoun handling of quoted sources and consistent explanation of time-sensitive events.
arXiv Detail & Related papers (2020-11-16T10:02:52Z) - CDEvalSumm: An Empirical Study of Cross-Dataset Evaluation for Neural
Summarization Systems [121.78477833009671]
We investigate the performance of different summarization models under a cross-dataset setting.
A comprehensive study of 11 representative summarization systems on 5 datasets from different domains reveals the effect of model architectures and generation ways.
arXiv Detail & Related papers (2020-10-11T02:19:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.