The Open Review-Based (ORB) dataset: Towards Automatic Assessment of
Scientific Papers and Experiment Proposals in High-Energy Physics
- URL: http://arxiv.org/abs/2312.04576v1
- Date: Wed, 29 Nov 2023 20:52:02 GMT
- Title: The Open Review-Based (ORB) dataset: Towards Automatic Assessment of
Scientific Papers and Experiment Proposals in High-Energy Physics
- Authors: Jaroslaw Szumega, Lamine Bougueroua, Blerina Gkotse, Pierre Jouvelot,
Federico Ravotti
- Abstract summary: We introduce the new comprehensive Open Review-Based dataset (ORB)
It includes a curated list of more than 36,000 scientific papers with their more than 89,000 reviews and final decisions.
This paper presents our data architecture and an overview of the collected data along with relevant statistics.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the Open Science approach becoming important for research, the evolution
towards open scientific-paper reviews is making an impact on the scientific
community. However, there is a lack of publicly available resources for
conducting research activities related to this subject, as only a limited
number of journals and conferences currently allow access to their review
process for interested parties. In this paper, we introduce the new
comprehensive Open Review-Based dataset (ORB); it includes a curated list of
more than 36,000 scientific papers with their more than 89,000 reviews and
final decisions. We gather this information from two sources: the
OpenReview.net and SciPost.org websites. However, given the volatile nature of
this domain, the software infrastructure that we introduce to supplement the
ORB dataset is designed to accommodate additional resources in the future. The
ORB deliverables include (1) Python code (interfaces and implementations) to
translate document data and metadata into a structured and high-level
representation, (2) an ETL process (Extract, Transform, Load) to facilitate the
automatic updates from defined sources and (3) data files representing the
structured data. The paper presents our data architecture and an overview of
the collected data along with relevant statistics. For illustration purposes,
we also discuss preliminary Natural-Language-Processing-based experiments that
aim to predict (1) papers' acceptance based on their textual embeddings, and
(2) grading statistics inferred from embeddings as well. We believe ORB
provides a valuable resource for researchers interested in open science and
review, with our implementation easing the use of this data for further
analysis and experimentation. We plan to update ORB as the field matures as
well as introduce new resources even more fitted to dedicated scientific
domains such as High-Energy Physics.
Related papers
- SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles.
Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z) - Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored.
We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches.
We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z) - SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks.
SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - MASSW: A New Dataset and Benchmark Tasks for AI-Assisted Scientific Workflows [58.56005277371235]
We introduce MASSW, a comprehensive text dataset on Multi-Aspect Summarization of ScientificAspects.
MASSW includes more than 152,000 peer-reviewed publications from 17 leading computer science conferences spanning the past 50 years.
We demonstrate the utility of MASSW through multiple novel machine-learning tasks that can be benchmarked using this new dataset.
arXiv Detail & Related papers (2024-06-10T15:19:09Z) - Enriched BERT Embeddings for Scholarly Publication Classification [0.13654846342364302]
The NSLP 2024 FoRC Task I addresses this challenge organized as a competition.
The goal is to develop a classifier capable of predicting one of 123 predefined classes from the Open Research Knowledge Graph (ORKG) taxonomy of research fields for a given article.
arXiv Detail & Related papers (2024-05-07T09:05:20Z) - Using Large Language Models to Enrich the Documentation of Datasets for Machine Learning [1.8270184406083445]
We explore using large language models (LLM) and prompting strategies to automatically extract dimensions from documents.
Our approach could aid data publishers and practitioners in creating machine-readable documentation.
We have released an open-source tool implementing our approach and a replication package, including the experiments' code and results.
arXiv Detail & Related papers (2024-04-04T10:09:28Z) - All Data on the Table: Novel Dataset and Benchmark for Cross-Modality
Scientific Information Extraction [39.05577374775964]
We propose a semi-supervised pipeline for annotating entities in text, as well as entities and relations in tables, in an iterative procedure.
We release novel resources for the scientific community, including a high-quality benchmark, a large-scale corpus, and a semi-supervised annotation pipeline.
arXiv Detail & Related papers (2023-11-14T14:22:47Z) - The Semantic Scholar Open Data Platform [79.4493235243312]
Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature.
We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction.
The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings.
arXiv Detail & Related papers (2023-01-24T17:13:08Z) - Citation Trajectory Prediction via Publication Influence Representation
Using Temporal Knowledge Graph [52.07771598974385]
Existing approaches mainly rely on mining temporal and graph data from academic articles.
Our framework is composed of three modules: difference-preserved graph embedding, fine-grained influence representation, and learning-based trajectory calculation.
Experiments are conducted on both the APS academic dataset and our contributed AIPatent dataset.
arXiv Detail & Related papers (2022-10-02T07:43:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.