Related papers: The Open Review-Based (ORB) dataset: Towards Automatic Assessment of Scientific Papers and Experiment Proposals in High-Energy Physics

The Open Review-Based (ORB) dataset: Towards Automatic Assessment of Scientific Papers and Experiment Proposals in High-Energy Physics

URL: http://arxiv.org/abs/2312.04576v1
Date: Wed, 29 Nov 2023 20:52:02 GMT
Title: The Open Review-Based (ORB) dataset: Towards Automatic Assessment of Scientific Papers and Experiment Proposals in High-Energy Physics
Authors: Jaroslaw Szumega, Lamine Bougueroua, Blerina Gkotse, Pierre Jouvelot, Federico Ravotti
Abstract summary: We introduce the new comprehensive Open Review-Based dataset (ORB) It includes a curated list of more than 36,000 scientific papers with their more than 89,000 reviews and final decisions. This paper presents our data architecture and an overview of the collected data along with relevant statistics.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the Open Science approach becoming important for research, the evolution towards open scientific-paper reviews is making an impact on the scientific community. However, there is a lack of publicly available resources for conducting research activities related to this subject, as only a limited number of journals and conferences currently allow access to their review process for interested parties. In this paper, we introduce the new comprehensive Open Review-Based dataset (ORB); it includes a curated list of more than 36,000 scientific papers with their more than 89,000 reviews and final decisions. We gather this information from two sources: the OpenReview.net and SciPost.org websites. However, given the volatile nature of this domain, the software infrastructure that we introduce to supplement the ORB dataset is designed to accommodate additional resources in the future. The ORB deliverables include (1) Python code (interfaces and implementations) to translate document data and metadata into a structured and high-level representation, (2) an ETL process (Extract, Transform, Load) to facilitate the automatic updates from defined sources and (3) data files representing the structured data. The paper presents our data architecture and an overview of the collected data along with relevant statistics. For illustration purposes, we also discuss preliminary Natural-Language-Processing-based experiments that aim to predict (1) papers' acceptance based on their textual embeddings, and (2) grading statistics inferred from embeddings as well. We believe ORB provides a valuable resource for researchers interested in open science and review, with our implementation easing the use of this data for further analysis and experimentation. We plan to update ORB as the field matures as well as introduce new resources even more fitted to dedicated scientific domains such as High-Energy Physics.

Related papers

SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles. Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z)
Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored. We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches. We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z)
SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks. SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z)
MASSW: A New Dataset and Benchmark Tasks for AI-Assisted Scientific Workflows [58.56005277371235]
We introduce MASSW, a comprehensive text dataset on Multi-Aspect Summarization of ScientificAspects. MASSW includes more than 152,000 peer-reviewed publications from 17 leading computer science conferences spanning the past 50 years. We demonstrate the utility of MASSW through multiple novel machine-learning tasks that can be benchmarked using this new dataset.
arXiv Detail & Related papers (2024-06-10T15:19:09Z)
Enriched BERT Embeddings for Scholarly Publication Classification [0.13654846342364302]
The NSLP 2024 FoRC Task I addresses this challenge organized as a competition. The goal is to develop a classifier capable of predicting one of 123 predefined classes from the Open Research Knowledge Graph (ORKG) taxonomy of research fields for a given article.
arXiv Detail & Related papers (2024-05-07T09:05:20Z)
Using Large Language Models to Enrich the Documentation of Datasets for Machine Learning [1.8270184406083445]
We explore using large language models (LLM) and prompting strategies to automatically extract dimensions from documents. Our approach could aid data publishers and practitioners in creating machine-readable documentation. We have released an open-source tool implementing our approach and a replication package, including the experiments' code and results.
arXiv Detail & Related papers (2024-04-04T10:09:28Z)
On the Readiness of Scientific Data for a Fair and Transparent Use in Machine Learning [1.8270184406083445]
We examine a sample of 4041 data papers of different domains, assessing their completeness, coverage of the requested dimensions, and trends in recent years. We propose a set of recommendation guidelines for data creators and scientific data publishers to increase their data's preparedness for its transparent and fairer use in machine learning technologies.
arXiv Detail & Related papers (2024-01-18T12:11:27Z)
All Data on the Table: Novel Dataset and Benchmark for Cross-Modality Scientific Information Extraction [39.05577374775964]
We propose a semi-supervised pipeline for annotating entities in text, as well as entities and relations in tables, in an iterative procedure. We release novel resources for the scientific community, including a high-quality benchmark, a large-scale corpus, and a semi-supervised annotation pipeline.
arXiv Detail & Related papers (2023-11-14T14:22:47Z)
The Semantic Scholar Open Data Platform [79.4493235243312]
Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature. We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction. The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings.
arXiv Detail & Related papers (2023-01-24T17:13:08Z)
Citation Trajectory Prediction via Publication Influence Representation Using Temporal Knowledge Graph [52.07771598974385]
Existing approaches mainly rely on mining temporal and graph data from academic articles. Our framework is composed of three modules: difference-preserved graph embedding, fine-grained influence representation, and learning-based trajectory calculation. Experiments are conducted on both the APS academic dataset and our contributed AIPatent dataset.
arXiv Detail & Related papers (2022-10-02T07:43:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.