Topic Segmentation of Research Article Collections
- URL: http://arxiv.org/abs/2205.11249v1
- Date: Wed, 18 May 2022 15:19:42 GMT
- Title: Topic Segmentation of Research Article Collections
- Authors: Erion \c{C}ano and Benjamin Roth
- Abstract summary: We perform topic segmentation of a paper data collection that we crawled and produce a multitopic dataset of roughly seven million paper data records.
We construct a taxonomy of topics extracted from the data records and then annotate each document with its corresponding topic from that taxonomy.
It is possible to use this newly proposed dataset in two modalities: as a heterogeneous collection of documents from various disciplines or as a set of homogeneous collections, each from a single research topic.
- Score: 4.0810783261728565
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Collections of research article data harvested from the web have become
common recently since they are important resources for experimenting on tasks
such as named entity recognition, text summarization, or keyword generation. In
fact, certain types of experiments require collections that are both large and
topically structured, with records assigned to separate research disciplines.
Unfortunately, the current collections of publicly available research articles
are either small or heterogeneous and unstructured. In this work, we perform
topic segmentation of a paper data collection that we crawled and produce a
multitopic dataset of roughly seven million paper data records. We construct a
taxonomy of topics extracted from the data records and then annotate each
document with its corresponding topic from that taxonomy. As a result, it is
possible to use this newly proposed dataset in two modalities: as a
heterogeneous collection of documents from various disciplines or as a set of
homogeneous collections, each from a single research topic.
Related papers
- Leveraging Collection-Wide Similarities for Unsupervised Document Structure Extraction [61.998789448260005]
We propose to identify the typical structure of document within a collection.
We abstract over arbitrary header paraphrases, and ground each topic to respective document locations.
We develop an unsupervised graph-based method which leverages both inter- and intra-document similarities.
arXiv Detail & Related papers (2024-02-21T16:22:21Z) - Interactive Distillation of Large Single-Topic Corpora of Scientific
Papers [1.2954493726326113]
A more robust but time-consuming approach is to build the dataset constructively in which a subject matter expert handpicks documents.
Here we showcase a new tool, based on machine learning, for constructively generating targeted datasets of scientific literature.
arXiv Detail & Related papers (2023-09-19T17:18:36Z) - Generating a Structured Summary of Numerous Academic Papers: Dataset and
Method [20.90939310713561]
We propose BigSurvey, the first large-scale dataset for generating comprehensive summaries of numerous academic papers on each topic.
We collect target summaries from more than seven thousand survey papers and utilize their 430 thousand reference papers' abstracts as input documents.
To organize the diverse content from dozens of input documents, we propose a summarization method named category-based alignment and sparse transformer (CAST)
arXiv Detail & Related papers (2023-02-09T11:42:07Z) - ReSel: N-ary Relation Extraction from Scientific Text and Tables by
Learning to Retrieve and Select [53.071352033539526]
We study the problem of extracting N-ary relations from scientific articles.
Our proposed method ReSel decomposes this task into a two-stage procedure.
Our experiments on three scientific information extraction datasets show that ReSel outperforms state-of-the-art baselines significantly.
arXiv Detail & Related papers (2022-10-26T02:28:02Z) - Topic Taxonomy Expansion via Hierarchy-Aware Topic Phrase Generation [58.3921103230647]
We propose a novel framework for topic taxonomy expansion, named TopicExpan.
TopicExpan directly generates topic-related terms belonging to new topics.
Experimental results on two real-world text corpora show that TopicExpan significantly outperforms other baseline methods in terms of the quality of output.
arXiv Detail & Related papers (2022-10-18T22:38:49Z) - TaxoCom: Topic Taxonomy Completion with Hierarchical Discovery of Novel
Topic Clusters [57.59286394188025]
We propose a novel framework for topic taxonomy completion, named TaxoCom.
TaxoCom discovers novel sub-topic clusters of terms and documents.
Our comprehensive experiments on two real-world datasets demonstrate that TaxoCom not only generates the high-quality topic taxonomy in terms of term coherency and topic coverage.
arXiv Detail & Related papers (2022-01-18T07:07:38Z) - CSFCube -- A Test Collection of Computer Science Research Articles for
Faceted Query by Example [43.01717754418893]
We introduce the task of faceted Query by Example.
Users can also specify a finer grained aspect in addition to the input query document.
We envision models which are able to retrieve scientific papers analogous to a query scientific paper.
arXiv Detail & Related papers (2021-03-24T01:02:12Z) - WikiAsp: A Dataset for Multi-domain Aspect-based Summarization [69.13865812754058]
We propose WikiAsp, a large-scale dataset for multi-domain aspect-based summarization.
Specifically, we build the dataset using Wikipedia articles from 20 different domains, using the section titles and boundaries of each article as a proxy for aspect annotation.
Results highlight key challenges that existing summarization models face in this setting, such as proper pronoun handling of quoted sources and consistent explanation of time-sensitive events.
arXiv Detail & Related papers (2020-11-16T10:02:52Z) - A Large-Scale Multi-Document Summarization Dataset from the Wikipedia
Current Events Portal [10.553314461761968]
Multi-document summarization (MDS) aims to compress the content in large document collections into short summaries.
This work presents a new dataset for MDS that is large both in the total number of document clusters and in the size of individual clusters.
arXiv Detail & Related papers (2020-05-20T14:33:33Z) - From Standard Summarization to New Tasks and Beyond: Summarization with
Manifold Information [77.89755281215079]
Text summarization is the research area aiming at creating a short and condensed version of the original document.
In real-world applications, most of the data is not in a plain text format.
This paper focuses on the survey of these new summarization tasks and approaches in the real-world application.
arXiv Detail & Related papers (2020-05-10T14:59:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.