Interactive Distillation of Large Single-Topic Corpora of Scientific
Papers
- URL: http://arxiv.org/abs/2309.10772v1
- Date: Tue, 19 Sep 2023 17:18:36 GMT
- Title: Interactive Distillation of Large Single-Topic Corpora of Scientific
Papers
- Authors: Nicholas Solovyev, Ryan Barron, Manish Bhattarai, Maksim E. Eren, Kim
O. Rasmussen, Boian S. Alexandrov
- Abstract summary: A more robust but time-consuming approach is to build the dataset constructively in which a subject matter expert handpicks documents.
Here we showcase a new tool, based on machine learning, for constructively generating targeted datasets of scientific literature.
- Score: 1.2954493726326113
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Highly specific datasets of scientific literature are important for both
research and education. However, it is difficult to build such datasets at
scale. A common approach is to build these datasets reductively by applying
topic modeling on an established corpus and selecting specific topics. A more
robust but time-consuming approach is to build the dataset constructively in
which a subject matter expert (SME) handpicks documents. This method does not
scale and is prone to error as the dataset grows. Here we showcase a new tool,
based on machine learning, for constructively generating targeted datasets of
scientific literature. Given a small initial "core" corpus of papers, we build
a citation network of documents. At each step of the citation network, we
generate text embeddings and visualize the embeddings through dimensionality
reduction. Papers are kept in the dataset if they are "similar" to the core or
are otherwise pruned through human-in-the-loop selection. Additional insight
into the papers is gained through sub-topic modeling using SeNMFk. We
demonstrate our new tool for literature review by applying it to two different
fields in machine learning.
Related papers
- SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles.
Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z) - MatViX: Multimodal Information Extraction from Visually Rich Articles [6.349779979863784]
In materials science, extracting structured information from research articles can accelerate the discovery of new materials.
We introduce textscMatViX, a benchmark consisting of $324$ full-length research articles and $1,688$ complex structured files.
These files are extracted from text, tables, and figures in full-length documents, providing a comprehensive challenge for MIE.
arXiv Detail & Related papers (2024-10-27T16:13:58Z) - Integrating Planning into Single-Turn Long-Form Text Generation [66.08871753377055]
We propose to use planning to generate long form content.
Our main novelty lies in a single auxiliary task that does not require multiple rounds of prompting or planning.
Our experiments demonstrate on two datasets from different domains, that LLMs fine-tuned with the auxiliary task generate higher quality documents.
arXiv Detail & Related papers (2024-10-08T17:02:40Z) - Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored.
We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches.
We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z) - Modeling citation worthiness by using attention-based bidirectional long short-term memory networks and interpretable models [0.0]
We propose a Bidirectional Long Short-Term Memory (BiLSTM) network with attention mechanism and contextual information to detect sentences that need citations.
We produce a new, large dataset (PMOA-CITE) based on PubMed Open Access Subset, which is orders of magnitude larger than previous datasets.
arXiv Detail & Related papers (2024-05-20T17:45:36Z) - CiteBench: A benchmark for Scientific Citation Text Generation [69.37571393032026]
CiteBench is a benchmark for citation text generation.
We make the code for CiteBench publicly available at https://github.com/UKPLab/citebench.
arXiv Detail & Related papers (2022-12-19T16:10:56Z) - Minimally-Supervised Structure-Rich Text Categorization via Learning on
Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network.
Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning.
Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z) - Method and Dataset Entity Mining in Scientific Literature: A CNN +
Bi-LSTM Model with Self-attention [21.93889297841459]
We propose a novel entity recognition model, called MDER, which is able to effectively extract the method and dataset entities from scientific papers.
We evaluate the proposed model on datasets constructed from the published papers of four research areas in computer science, i.e., NLP, CV, Data Mining and AI.
arXiv Detail & Related papers (2020-10-26T13:38:43Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z) - Machine Identification of High Impact Research through Text and Image
Analysis [0.4737991126491218]
We present a system to automatically separate papers with a high from those with a low likelihood of gaining citations.
Our system uses both a visual classifier, useful for surmising a document's overall appearance, and a text classifier, for making content-informed decisions.
arXiv Detail & Related papers (2020-05-20T19:12:24Z) - A Large-Scale Multi-Document Summarization Dataset from the Wikipedia
Current Events Portal [10.553314461761968]
Multi-document summarization (MDS) aims to compress the content in large document collections into short summaries.
This work presents a new dataset for MDS that is large both in the total number of document clusters and in the size of individual clusters.
arXiv Detail & Related papers (2020-05-20T14:33:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.