Nougat: Neural Optical Understanding for Academic Documents
- URL: http://arxiv.org/abs/2308.13418v1
- Date: Fri, 25 Aug 2023 15:03:36 GMT
- Title: Nougat: Neural Optical Understanding for Academic Documents
- Authors: Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic
- Abstract summary: We propose a Visual Transformer model that performs an Optical Character Recognition (OCR) task for processing scientific documents into a markup language.
The proposed approach offers a promising solution to enhance the accessibility of scientific knowledge in the digital age.
- Score: 15.242993369368111
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Scientific knowledge is predominantly stored in books and scientific
journals, often in the form of PDFs. However, the PDF format leads to a loss of
semantic information, particularly for mathematical expressions. We propose
Nougat (Neural Optical Understanding for Academic Documents), a Visual
Transformer model that performs an Optical Character Recognition (OCR) task for
processing scientific documents into a markup language, and demonstrate the
effectiveness of our model on a new dataset of scientific documents. The
proposed approach offers a promising solution to enhance the accessibility of
scientific knowledge in the digital age, by bridging the gap between
human-readable documents and machine-readable text. We release the models and
code to accelerate future work on scientific text recognition.
Related papers
- DocReLM: Mastering Document Retrieval with Language Model [49.847369507694154]
We demonstrate that by utilizing large language models, a document retrieval system can achieve advanced semantic understanding capabilities.
Our approach involves training the retriever and reranker using domain-specific data generated by large language models.
We use a test set annotated by academic researchers in the fields of quantum physics and computer vision to evaluate our system's performance.
arXiv Detail & Related papers (2024-05-19T06:30:22Z) - PEaCE: A Chemistry-Oriented Dataset for Optical Character Recognition on Scientific Documents [4.191058827240492]
We present the Printed English and Chemical Equations (PEaCE) dataset, containing both synthetic and real-world records.
We evaluate the efficacy of transformer-based OCR models when trained on this resource.
arXiv Detail & Related papers (2024-03-23T05:20:36Z) - ATLANTIC: Structure-Aware Retrieval-Augmented Language Model for
Interdisciplinary Science [0.0]
Large language models record impressive performance on many natural language processing tasks.
Retrieval augmentation offers an effective solution by retrieving context from external knowledge sources.
We propose a novel structure-aware retrieval augmented language model that accommodates document structure during retrieval augmentation.
arXiv Detail & Related papers (2023-11-21T02:02:46Z) - Large Language Models for Scientific Synthesis, Inference and
Explanation [56.41963802804953]
We show how large language models can perform scientific synthesis, inference, and explanation.
We show that the large language model can augment this "knowledge" by synthesizing from the scientific literature.
This approach has the further advantage that the large language model can explain the machine learning system's predictions.
arXiv Detail & Related papers (2023-10-12T02:17:59Z) - MIReAD: Simple Method for Learning High-quality Representations from
Scientific Documents [77.34726150561087]
We propose MIReAD, a simple method that learns high-quality representations of scientific papers.
We train MIReAD on more than 500,000 PubMed and arXiv abstracts across over 2,000 journal classes.
arXiv Detail & Related papers (2023-05-07T03:29:55Z) - The Semantic Scholar Open Data Platform [79.4493235243312]
Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature.
We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction.
The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings.
arXiv Detail & Related papers (2023-01-24T17:13:08Z) - Modeling Information Change in Science Communication with Semantically
Matched Paraphrases [50.67030449927206]
SPICED is the first paraphrase dataset of scientific findings annotated for degree of information change.
SPICED contains 6,000 scientific finding pairs extracted from news stories, social media discussions, and full texts of original papers.
Models trained on SPICED improve downstream performance on evidence retrieval for fact checking of real-world scientific claims.
arXiv Detail & Related papers (2022-10-24T07:44:38Z) - Automated Creation and Human-assisted Curation of Computable Scientific
Models from Code and Text [2.3746609573239756]
Domain experts cannot gain a complete understanding of the implementation of a scientific model if they are not familiar with the code.
We develop a system for the automated creation and human-assisted curation of scientific models.
We present experimental results obtained using a dataset of code and associated text derived from NASA's Hypersonic Aerodynamics website.
arXiv Detail & Related papers (2022-01-28T17:31:38Z) - Vision-Based Layout Detection from Scientific Literature using Recurrent
Convolutional Neural Networks [12.221478896815292]
We present an approach for adapting convolutional neural networks for object recognition and classification to scientific literature layout detection (SLLD)
SLLD is a shared subtask of several information extraction problems.
Our results show good improvement with fine-tuning of a pre-trained base network.
arXiv Detail & Related papers (2020-10-18T23:50:28Z) - SPECTER: Document-level Representation Learning using Citation-informed
Transformers [51.048515757909215]
SPECTER generates document-level embedding of scientific documents based on pretraining a Transformer language model.
We introduce SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction to document classification and recommendation.
arXiv Detail & Related papers (2020-04-15T16:05:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.