Nougat: Neural Optical Understanding for Academic Documents
- URL: http://arxiv.org/abs/2308.13418v1
- Date: Fri, 25 Aug 2023 15:03:36 GMT
- Title: Nougat: Neural Optical Understanding for Academic Documents
- Authors: Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic
- Abstract summary: We propose a Visual Transformer model that performs an Optical Character Recognition (OCR) task for processing scientific documents into a markup language.
The proposed approach offers a promising solution to enhance the accessibility of scientific knowledge in the digital age.
- Score: 15.242993369368111
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Scientific knowledge is predominantly stored in books and scientific
journals, often in the form of PDFs. However, the PDF format leads to a loss of
semantic information, particularly for mathematical expressions. We propose
Nougat (Neural Optical Understanding for Academic Documents), a Visual
Transformer model that performs an Optical Character Recognition (OCR) task for
processing scientific documents into a markup language, and demonstrate the
effectiveness of our model on a new dataset of scientific documents. The
proposed approach offers a promising solution to enhance the accessibility of
scientific knowledge in the digital age, by bridging the gap between
human-readable documents and machine-readable text. We release the models and
code to accelerate future work on scientific text recognition.
Related papers
- The Discovery Engine: A Framework for AI-Driven Synthesis and Navigation of Scientific Knowledge Landscapes [0.0]
We introduce the Discovery Engine, a framework to transform literature into a unified, computationally tractable representation of a scientific domain.<n>The Discovery Engine offers a new paradigm for AI-augmented scientific inquiry and accelerated discovery.
arXiv Detail & Related papers (2025-05-23T05:51:34Z) - SciMantify -- A Hybrid Approach for the Evolving Semantification of Scientific Knowledge [0.4499833362998487]
We propose an evolution model of knowledge representation, inspired by the 5-star Linked Open Data (LOD) model.<n>We develop a hybrid approach, called SciMantify, to support its evolving semantification.<n>We implement the approach in the Open Research Knowledge Graph (ORKG), an established platform for improving the findability, accessibility, interoperability, and reusability of scientific knowledge.
arXiv Detail & Related papers (2025-04-14T07:57:55Z) - Collage: Decomposable Rapid Prototyping for Information Extraction on Scientific PDFs [15.610004991273005]
We present Collage, a tool designed for rapid prototyping, visualization, and evaluation of different information extraction models on scientific PDFs.
We enable both developers and users of NLP-based tools to inspect, debug, and better understand modeling pipelines by providing granular views of intermediate states of processing.
arXiv Detail & Related papers (2024-10-30T22:00:34Z) - UNIT: Unifying Image and Text Recognition in One Vision Encoder [51.140564856352825]
UNIT is a novel training framework aimed at UNifying Image and Text recognition within a single model.
We show that UNIT significantly outperforms existing methods on document-related tasks.
Notably, UNIT retains the original vision encoder architecture, making it cost-free in terms of inference and deployment.
arXiv Detail & Related papers (2024-09-06T08:02:43Z) - DocReLM: Mastering Document Retrieval with Language Model [49.847369507694154]
We demonstrate that by utilizing large language models, a document retrieval system can achieve advanced semantic understanding capabilities.
Our approach involves training the retriever and reranker using domain-specific data generated by large language models.
We use a test set annotated by academic researchers in the fields of quantum physics and computer vision to evaluate our system's performance.
arXiv Detail & Related papers (2024-05-19T06:30:22Z) - PEaCE: A Chemistry-Oriented Dataset for Optical Character Recognition on Scientific Documents [4.191058827240492]
We present the Printed English and Chemical Equations (PEaCE) dataset, containing both synthetic and real-world records.
We evaluate the efficacy of transformer-based OCR models when trained on this resource.
arXiv Detail & Related papers (2024-03-23T05:20:36Z) - ATLANTIC: Structure-Aware Retrieval-Augmented Language Model for
Interdisciplinary Science [0.0]
Large language models record impressive performance on many natural language processing tasks.
Retrieval augmentation offers an effective solution by retrieving context from external knowledge sources.
We propose a novel structure-aware retrieval augmented language model that accommodates document structure during retrieval augmentation.
arXiv Detail & Related papers (2023-11-21T02:02:46Z) - MIReAD: Simple Method for Learning High-quality Representations from
Scientific Documents [77.34726150561087]
We propose MIReAD, a simple method that learns high-quality representations of scientific papers.
We train MIReAD on more than 500,000 PubMed and arXiv abstracts across over 2,000 journal classes.
arXiv Detail & Related papers (2023-05-07T03:29:55Z) - The Semantic Scholar Open Data Platform [79.4493235243312]
Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature.
We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction.
The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings.
arXiv Detail & Related papers (2023-01-24T17:13:08Z) - Modeling Information Change in Science Communication with Semantically
Matched Paraphrases [50.67030449927206]
SPICED is the first paraphrase dataset of scientific findings annotated for degree of information change.
SPICED contains 6,000 scientific finding pairs extracted from news stories, social media discussions, and full texts of original papers.
Models trained on SPICED improve downstream performance on evidence retrieval for fact checking of real-world scientific claims.
arXiv Detail & Related papers (2022-10-24T07:44:38Z) - Automated Creation and Human-assisted Curation of Computable Scientific
Models from Code and Text [2.3746609573239756]
Domain experts cannot gain a complete understanding of the implementation of a scientific model if they are not familiar with the code.
We develop a system for the automated creation and human-assisted curation of scientific models.
We present experimental results obtained using a dataset of code and associated text derived from NASA's Hypersonic Aerodynamics website.
arXiv Detail & Related papers (2022-01-28T17:31:38Z) - Vision-Based Layout Detection from Scientific Literature using Recurrent
Convolutional Neural Networks [12.221478896815292]
We present an approach for adapting convolutional neural networks for object recognition and classification to scientific literature layout detection (SLLD)
SLLD is a shared subtask of several information extraction problems.
Our results show good improvement with fine-tuning of a pre-trained base network.
arXiv Detail & Related papers (2020-10-18T23:50:28Z) - SPECTER: Document-level Representation Learning using Citation-informed
Transformers [51.048515757909215]
SPECTER generates document-level embedding of scientific documents based on pretraining a Transformer language model.
We introduce SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction to document classification and recommendation.
arXiv Detail & Related papers (2020-04-15T16:05:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.