Related papers: OASum: Large-Scale Open Domain Aspect-based Summarization

OASum: Large-Scale Open Domain Aspect-based Summarization

URL: http://arxiv.org/abs/2212.09233v2
Date: Thu, 25 May 2023 22:29:45 GMT
Title: OASum: Large-Scale Open Domain Aspect-based Summarization
Authors: Xianjun Yang, Kaiqiang Song, Sangwoo Cho, Xiaoyang Wang, Xiaoman Pan, Linda Petzold, Dong Yu
Abstract summary: We take advantage of crowd-sourcing knowledge on Wikipedia.org and automatically create a high-quality, large-scale aspect-based summarization dataset named OASum. OASum contains more than 3.7 million instances with around 1 million different aspects on 2 million Wikipedia pages. To overcome the data scarcity problem on specific domains, we also perform zero-shot, few-shot, and fine-tuning on seven downstream datasets.
Score: 29.45232847592956
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Aspect or query-based summarization has recently caught more attention, as it can generate differentiated summaries based on users' interests. However, the current dataset for aspect or query-based summarization either focuses on specific domains, contains relatively small-scale instances, or includes only a few aspect types. Such limitations hinder further explorations in this direction. In this work, we take advantage of crowd-sourcing knowledge on Wikipedia.org and automatically create a high-quality, large-scale open-domain aspect-based summarization dataset named OASum, which contains more than 3.7 million instances with around 1 million different aspects on 2 million Wikipedia pages. We provide benchmark results on OASum and demonstrate its ability for diverse aspect-based summarization generation. To overcome the data scarcity problem on specific domains, we also perform zero-shot, few-shot, and fine-tuning on seven downstream datasets. Specifically, zero/few-shot and fine-tuning results show that the model pre-trained on our corpus demonstrates a strong aspect or query-focused generation ability compared with the backbone model. Our dataset and pre-trained checkpoints are publicly available.

Related papers

Enhancing Dataset Distillation via Non-Critical Region Refinement [29.858754062202213]
We introduce the Non-Critical Region Refinement dataset Distillation (NRR-DD) method, which preserves instance-specific details and fine-grained regions in synthetic data. We also present Distance-Based Representative (DBR) knowledge transfer, which eliminates the need for soft labels in training. Experimental results show that NRR-DD achieves state-of-the-art performance on both small- and large-scale datasets.
arXiv Detail & Related papers (2025-03-24T01:20:22Z)
Pedestrian Attribute Recognition: A New Benchmark Dataset and A Large Language Model Augmented Framework [15.991114464911844]
In the past five years, no large-scale dataset has been opened to the public. This paper proposes a new large-scale, cross-domain pedestrian attribute recognition dataset, MSP60K. It consists of 60,122 images and 57 attribute annotations across eight scenarios.
arXiv Detail & Related papers (2024-08-19T06:19:31Z)
Wiki Entity Summarization Benchmark [9.25319552487389]
Entity summarization aims to compute concise summaries for entities in knowledge graphs. Existing datasets and benchmarks are often limited to a few hundred entities. We propose WikES, a comprehensive benchmark comprising of entities, their summaries, and their connections.
arXiv Detail & Related papers (2024-06-12T17:22:00Z)
ACLSum: A New Dataset for Aspect-based Summarization of Scientific Publications [10.529898520273063]
ACLSum is a novel summarization dataset carefully crafted and evaluated by domain experts. In contrast to previous datasets, ACLSum facilitates multi-aspect summarization of scientific papers.
arXiv Detail & Related papers (2024-03-08T13:32:01Z)
Domain-Expanded ASTE: Rethinking Generalization in Aspect Sentiment Triplet Extraction [67.54420015049732]
Aspect Sentiment Triplet Extraction (ASTE) is a challenging task in sentiment analysis, aiming to provide fine-grained insights into human sentiments. Existing benchmarks are limited to two domains and do not evaluate model performance on unseen domains. We introduce a domain-expanded benchmark by annotating samples from diverse domains, enabling evaluation of models in both in-domain and out-of-domain settings.
arXiv Detail & Related papers (2023-05-23T18:01:49Z)
LMGQS: A Large-scale Dataset for Query-focused Summarization [77.6179359525065]
We convert four generic summarization benchmarks into a new QFS benchmark dataset, LMGQS. We establish baselines with state-of-the-art summarization models. We achieve state-of-the-art zero-shot and supervised performance on multiple existing QFS benchmarks.
arXiv Detail & Related papers (2023-05-22T14:53:45Z)
Combining Data Generation and Active Learning for Low-Resource Question Answering [23.755283239897132]
We propose a novel approach that combines data augmentation via question-answer generation with Active Learning to improve performance in low-resource settings. Our findings show that our novel approach, where humans are incorporated in a data generation approach, boosts performance in the low-resource, domain-specific setting.
arXiv Detail & Related papers (2022-11-27T16:31:33Z)
Multi-Domain Long-Tailed Learning by Augmenting Disentangled Representations [80.76164484820818]
There is an inescapable long-tailed class-imbalance issue in many real-world classification problems. We study this multi-domain long-tailed learning problem and aim to produce a model that generalizes well across all classes and domains. Built upon a proposed selective balanced sampling strategy, TALLY achieves this by mixing the semantic representation of one example with the domain-associated nuisances of another.
arXiv Detail & Related papers (2022-10-25T21:54:26Z)
Efficient Few-Shot Fine-Tuning for Opinion Summarization [83.76460801568092]
Abstractive summarization models are typically pre-trained on large amounts of generic texts, then fine-tuned on tens or hundreds of thousands of annotated samples. We show that a few-shot method based on adapters can easily store in-domain knowledge. We show that this self-supervised adapter pre-training improves summary quality over standard fine-tuning by 2.0 and 1.3 ROUGE-L points on the Amazon and Yelp datasets.
arXiv Detail & Related papers (2022-05-04T16:38:37Z)
WikiAsp: A Dataset for Multi-domain Aspect-based Summarization [69.13865812754058]
We propose WikiAsp, a large-scale dataset for multi-domain aspect-based summarization. Specifically, we build the dataset using Wikipedia articles from 20 different domains, using the section titles and boundaries of each article as a proxy for aspect annotation. Results highlight key challenges that existing summarization models face in this setting, such as proper pronoun handling of quoted sources and consistent explanation of time-sensitive events.
arXiv Detail & Related papers (2020-11-16T10:02:52Z)
CDEvalSumm: An Empirical Study of Cross-Dataset Evaluation for Neural Summarization Systems [121.78477833009671]
We investigate the performance of different summarization models under a cross-dataset setting. A comprehensive study of 11 representative summarization systems on 5 datasets from different domains reveals the effect of model architectures and generation ways.
arXiv Detail & Related papers (2020-10-11T02:19:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.