OASum: Large-Scale Open Domain Aspect-based Summarization
- URL: http://arxiv.org/abs/2212.09233v2
- Date: Thu, 25 May 2023 22:29:45 GMT
- Title: OASum: Large-Scale Open Domain Aspect-based Summarization
- Authors: Xianjun Yang, Kaiqiang Song, Sangwoo Cho, Xiaoyang Wang, Xiaoman Pan,
Linda Petzold, Dong Yu
- Abstract summary: We take advantage of crowd-sourcing knowledge on Wikipedia.org and automatically create a high-quality, large-scale aspect-based summarization dataset named OASum.
OASum contains more than 3.7 million instances with around 1 million different aspects on 2 million Wikipedia pages.
To overcome the data scarcity problem on specific domains, we also perform zero-shot, few-shot, and fine-tuning on seven downstream datasets.
- Score: 29.45232847592956
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Aspect or query-based summarization has recently caught more attention, as it
can generate differentiated summaries based on users' interests. However, the
current dataset for aspect or query-based summarization either focuses on
specific domains, contains relatively small-scale instances, or includes only a
few aspect types. Such limitations hinder further explorations in this
direction. In this work, we take advantage of crowd-sourcing knowledge on
Wikipedia.org and automatically create a high-quality, large-scale open-domain
aspect-based summarization dataset named OASum, which contains more than 3.7
million instances with around 1 million different aspects on 2 million
Wikipedia pages. We provide benchmark results on OASum and demonstrate its
ability for diverse aspect-based summarization generation. To overcome the data
scarcity problem on specific domains, we also perform zero-shot, few-shot, and
fine-tuning on seven downstream datasets. Specifically, zero/few-shot and
fine-tuning results show that the model pre-trained on our corpus demonstrates
a strong aspect or query-focused generation ability compared with the backbone
model. Our dataset and pre-trained checkpoints are publicly available.
Related papers
- ACLSum: A New Dataset for Aspect-based Summarization of Scientific
Publications [10.529898520273063]
ACLSum is a novel summarization dataset carefully crafted and evaluated by domain experts.
In contrast to previous datasets, ACLSum facilitates multi-aspect summarization of scientific papers.
arXiv Detail & Related papers (2024-03-08T13:32:01Z) - OpenAsp: A Benchmark for Multi-document Open Aspect-based Summarization [19.079053035229695]
We introduce OpenAsp, a benchmark for aspect-based summarization.
We show that the realistic open-aspect setting realized in OpenAsp poses a challenge for current state-of-the-art summarization models.
arXiv Detail & Related papers (2023-12-07T17:06:20Z) - LMGQS: A Large-scale Dataset for Query-focused Summarization [77.6179359525065]
We convert four generic summarization benchmarks into a new QFS benchmark dataset, LMGQS.
We establish baselines with state-of-the-art summarization models.
We achieve state-of-the-art zero-shot and supervised performance on multiple existing QFS benchmarks.
arXiv Detail & Related papers (2023-05-22T14:53:45Z) - Multi-Domain Long-Tailed Learning by Augmenting Disentangled
Representations [80.76164484820818]
There is an inescapable long-tailed class-imbalance issue in many real-world classification problems.
We study this multi-domain long-tailed learning problem and aim to produce a model that generalizes well across all classes and domains.
Built upon a proposed selective balanced sampling strategy, TALLY achieves this by mixing the semantic representation of one example with the domain-associated nuisances of another.
arXiv Detail & Related papers (2022-10-25T21:54:26Z) - Efficient Few-Shot Fine-Tuning for Opinion Summarization [83.76460801568092]
Abstractive summarization models are typically pre-trained on large amounts of generic texts, then fine-tuned on tens or hundreds of thousands of annotated samples.
We show that a few-shot method based on adapters can easily store in-domain knowledge.
We show that this self-supervised adapter pre-training improves summary quality over standard fine-tuning by 2.0 and 1.3 ROUGE-L points on the Amazon and Yelp datasets.
arXiv Detail & Related papers (2022-05-04T16:38:37Z) - How well do you know your summarization datasets? [11.992125069326772]
We analyze 600 samples from three popular summarization datasets.
We follow with a thorough analysis of 27 state-of-the-art summarization models and 5 popular metrics.
arXiv Detail & Related papers (2021-06-21T19:44:06Z) - Cross-domain Time Series Forecasting with Attention Sharing [10.180248006928107]
We propose a novel domain adaptation framework,Domain Adaptation Forecaster (DAF), to cope with the issue of data scarcity.
In particular, we pro-pose an attention-based shared module with a do-main discriminator across domains as well as pri-vate modules for individual domains.
This allowsus to jointly train the source and target domains bygenerating domain-invariant latent features whileretraining domain-specific features.
arXiv Detail & Related papers (2021-02-13T00:26:35Z) - Abstractive Query Focused Summarization with Query-Free Resources [60.468323530248945]
In this work, we consider the problem of leveraging only generic summarization resources to build an abstractive QFS system.
We propose Marge, a Masked ROUGE Regression framework composed of a novel unified representation for summaries and queries.
Despite learning from minimal supervision, our system achieves state-of-the-art results in the distantly supervised setting.
arXiv Detail & Related papers (2020-12-29T14:39:35Z) - WikiAsp: A Dataset for Multi-domain Aspect-based Summarization [69.13865812754058]
We propose WikiAsp, a large-scale dataset for multi-domain aspect-based summarization.
Specifically, we build the dataset using Wikipedia articles from 20 different domains, using the section titles and boundaries of each article as a proxy for aspect annotation.
Results highlight key challenges that existing summarization models face in this setting, such as proper pronoun handling of quoted sources and consistent explanation of time-sensitive events.
arXiv Detail & Related papers (2020-11-16T10:02:52Z) - CDEvalSumm: An Empirical Study of Cross-Dataset Evaluation for Neural
Summarization Systems [121.78477833009671]
We investigate the performance of different summarization models under a cross-dataset setting.
A comprehensive study of 11 representative summarization systems on 5 datasets from different domains reveals the effect of model architectures and generation ways.
arXiv Detail & Related papers (2020-10-11T02:19:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.