DeepShovel: An Online Collaborative Platform for Data Extraction in
Geoscience Literature with AI Assistance
- URL: http://arxiv.org/abs/2202.10163v2
- Date: Thu, 24 Feb 2022 05:39:59 GMT
- Title: DeepShovel: An Online Collaborative Platform for Data Extraction in
Geoscience Literature with AI Assistance
- Authors: Shao Zhang, Yuting Jia, Hui Xu, Ying Wen, Dakuo Wang, Xinbing Wang
- Abstract summary: Geoscientists need to read a huge amount of literature to locate, extract, and aggregate relevant results and data.
DeepShovel is a publicly-available AI-assisted data extraction system to support their needs.
A follow-up user evaluation with 14 researchers suggested DeepShovel improved users' efficiency of data extraction for building scientific databases.
- Score: 48.55345030503826
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Geoscientists, as well as researchers in many fields, need to read a huge
amount of literature to locate, extract, and aggregate relevant results and
data to enable future research or to build a scientific database, but there is
no existing system to support this use case well. In this paper, based on the
findings of a formative study about how geoscientists collaboratively annotate
literature and extract and aggregate data, we proposed DeepShovel, a
publicly-available AI-assisted data extraction system to support their needs.
DeepShovel leverages the state-of-the-art neural network models to support
researcher(s) easily and accurately annotate papers (in the PDF format) and
extract data from tables, figures, maps, etc. in a human-AI collaboration
manner. A follow-up user evaluation with 14 researchers suggested DeepShovel
improved users' efficiency of data extraction for building scientific
databases, and encouraged teams to form a larger scale but more tightly-coupled
collaboration.
Related papers
- A Survey on Data Selection for Language Models [151.6210632830082]
Data selection methods aim to determine which data points to include in a training dataset.
Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive.
Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z) - Conflating point of interest (POI) data: A systematic review of matching
methods [5.439489511940086]
Point of interest (POI) data provide digital representations of places in the real world.
Many POI datasets have been developed, which often have different geographic coverages, attribute focuses, and data quality.
Researchers may need to conflate two or more POI datasets in order to build a better representation of the places in the study areas.
arXiv Detail & Related papers (2023-10-23T19:38:31Z) - Data-Driven Information Extraction and Enrichment of Molecular Profiling
Data for Cancer Cell Lines [1.1999555634662633]
This work presents the design, implementation and application of a novel data extraction and exploration system.
We introduce a new public data exploration portal, which enables automatic linking of genomic copy number variants plots with ranked, related entities.
Our system is publicly available on the web at https://cancercelllines.org.
arXiv Detail & Related papers (2023-07-03T11:15:42Z) - GAIA Search: Hugging Face and Pyserini Interoperability for NLP Training
Data Exploration [97.68234051078997]
We discuss how Pyserini can be integrated with the Hugging Face ecosystem of open-source AI libraries and artifacts.
We include a Jupyter Notebook-based walk through the core interoperability features, available on GitHub.
We present GAIA Search - a search engine built following previously laid out principles, giving access to four popular large-scale text collections.
arXiv Detail & Related papers (2023-06-02T12:09:59Z) - Assessing Scientific Contributions in Data Sharing Spaces [64.16762375635842]
This paper introduces the SCIENCE-index, a blockchain-based metric measuring a researcher's scientific contributions.
To incentivize researchers to share their data, the SCIENCE-index is augmented to include a data-sharing parameter.
Our model is evaluated by comparing the distribution of its output for geographically diverse researchers to that of the h-index.
arXiv Detail & Related papers (2023-03-18T19:17:47Z) - The Semantic Scholar Open Data Platform [79.4493235243312]
Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature.
We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction.
The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings.
arXiv Detail & Related papers (2023-01-24T17:13:08Z) - Research Trends and Applications of Data Augmentation Algorithms [77.34726150561087]
We identify the main areas of application of data augmentation algorithms, the types of algorithms used, significant research trends, their progression over time and research gaps in data augmentation literature.
We expect readers to understand the potential of data augmentation, as well as identify future research directions and open questions within data augmentation research.
arXiv Detail & Related papers (2022-07-18T11:38:32Z) - Librarian-in-the-Loop: A Natural Language Processing Paradigm for
Detecting Informal Mentions of Research Data in Academic Literature [1.4190701053683017]
We propose a natural language processing paradigm to support the human task of identifying informal mentions made to research datasets.
The work of discovering informal data mentions is currently performed by librarians and their staff in the Inter-university Consortium for Political and Social Research.
arXiv Detail & Related papers (2022-03-10T02:11:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.