Related papers: Nougat: Neural Optical Understanding for Academic Documents

Nougat: Neural Optical Understanding for Academic Documents

URL: http://arxiv.org/abs/2308.13418v1
Date: Fri, 25 Aug 2023 15:03:36 GMT
Title: Nougat: Neural Optical Understanding for Academic Documents
Authors: Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic
Abstract summary: We propose a Visual Transformer model that performs an Optical Character Recognition (OCR) task for processing scientific documents into a markup language. The proposed approach offers a promising solution to enhance the accessibility of scientific knowledge in the digital age.
Score: 15.242993369368111
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Scientific knowledge is predominantly stored in books and scientific journals, often in the form of PDFs. However, the PDF format leads to a loss of semantic information, particularly for mathematical expressions. We propose Nougat (Neural Optical Understanding for Academic Documents), a Visual Transformer model that performs an Optical Character Recognition (OCR) task for processing scientific documents into a markup language, and demonstrate the effectiveness of our model on a new dataset of scientific documents. The proposed approach offers a promising solution to enhance the accessibility of scientific knowledge in the digital age, by bridging the gap between human-readable documents and machine-readable text. We release the models and code to accelerate future work on scientific text recognition.

Related papers

Advancing Scientific Knowledge Retrieval and Reuse with a Novel Digital Library for Machine-Readable Knowledge [4.450387519903374]
ORKG reborn is an emerging digital library that supports finding, accessing, and reusing accurate, fine-grained, and reproducible machine-readable expressions of scientific knowledge.<n>We describe the proposed system and demonstrate its practical viability and potential for information retrieval in contrast to state-of-the-art digital libraries and document-centric scholarly communication.
arXiv Detail & Related papers (2025-11-11T17:20:02Z)
The Discovery Engine: A Framework for AI-Driven Synthesis and Navigation of Scientific Knowledge Landscapes [0.0]
We introduce the Discovery Engine, a framework to transform literature into a unified, computationally tractable representation of a scientific domain.<n>The Discovery Engine offers a new paradigm for AI-augmented scientific inquiry and accelerated discovery.
arXiv Detail & Related papers (2025-05-23T05:51:34Z)
SciMantify -- A Hybrid Approach for the Evolving Semantification of Scientific Knowledge [0.4499833362998487]
We propose an evolution model of knowledge representation, inspired by the 5-star Linked Open Data (LOD) model.<n>We develop a hybrid approach, called SciMantify, to support its evolving semantification.<n>We implement the approach in the Open Research Knowledge Graph (ORKG), an established platform for improving the findability, accessibility, interoperability, and reusability of scientific knowledge.
arXiv Detail & Related papers (2025-04-14T07:57:55Z)
Collage: Decomposable Rapid Prototyping for Information Extraction on Scientific PDFs [15.610004991273005]
We present Collage, a tool designed for rapid prototyping, visualization, and evaluation of different information extraction models on scientific PDFs. We enable both developers and users of NLP-based tools to inspect, debug, and better understand modeling pipelines by providing granular views of intermediate states of processing.
arXiv Detail & Related papers (2024-10-30T22:00:34Z)
UNIT: Unifying Image and Text Recognition in One Vision Encoder [51.140564856352825]
UNIT is a novel training framework aimed at UNifying Image and Text recognition within a single model. We show that UNIT significantly outperforms existing methods on document-related tasks. Notably, UNIT retains the original vision encoder architecture, making it cost-free in terms of inference and deployment.
arXiv Detail & Related papers (2024-09-06T08:02:43Z)
DocReLM: Mastering Document Retrieval with Language Model [49.847369507694154]
We demonstrate that by utilizing large language models, a document retrieval system can achieve advanced semantic understanding capabilities. Our approach involves training the retriever and reranker using domain-specific data generated by large language models. We use a test set annotated by academic researchers in the fields of quantum physics and computer vision to evaluate our system's performance.
arXiv Detail & Related papers (2024-05-19T06:30:22Z)
PEaCE: A Chemistry-Oriented Dataset for Optical Character Recognition on Scientific Documents [4.191058827240492]
We present the Printed English and Chemical Equations (PEaCE) dataset, containing both synthetic and real-world records. We evaluate the efficacy of transformer-based OCR models when trained on this resource.
arXiv Detail & Related papers (2024-03-23T05:20:36Z)
ATLANTIC: Structure-Aware Retrieval-Augmented Language Model for Interdisciplinary Science [0.0]
Large language models record impressive performance on many natural language processing tasks. Retrieval augmentation offers an effective solution by retrieving context from external knowledge sources. We propose a novel structure-aware retrieval augmented language model that accommodates document structure during retrieval augmentation.
arXiv Detail & Related papers (2023-11-21T02:02:46Z)
MIReAD: Simple Method for Learning High-quality Representations from Scientific Documents [77.34726150561087]
We propose MIReAD, a simple method that learns high-quality representations of scientific papers. We train MIReAD on more than 500,000 PubMed and arXiv abstracts across over 2,000 journal classes.
arXiv Detail & Related papers (2023-05-07T03:29:55Z)
The Semantic Scholar Open Data Platform [79.4493235243312]
Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature. We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction. The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings.
arXiv Detail & Related papers (2023-01-24T17:13:08Z)
Modeling Information Change in Science Communication with Semantically Matched Paraphrases [50.67030449927206]
SPICED is the first paraphrase dataset of scientific findings annotated for degree of information change. SPICED contains 6,000 scientific finding pairs extracted from news stories, social media discussions, and full texts of original papers. Models trained on SPICED improve downstream performance on evidence retrieval for fact checking of real-world scientific claims.
arXiv Detail & Related papers (2022-10-24T07:44:38Z)
Automated Creation and Human-assisted Curation of Computable Scientific Models from Code and Text [2.3746609573239756]
Domain experts cannot gain a complete understanding of the implementation of a scientific model if they are not familiar with the code. We develop a system for the automated creation and human-assisted curation of scientific models. We present experimental results obtained using a dataset of code and associated text derived from NASA's Hypersonic Aerodynamics website.
arXiv Detail & Related papers (2022-01-28T17:31:38Z)
Vision-Based Layout Detection from Scientific Literature using Recurrent Convolutional Neural Networks [12.221478896815292]
We present an approach for adapting convolutional neural networks for object recognition and classification to scientific literature layout detection (SLLD) SLLD is a shared subtask of several information extraction problems. Our results show good improvement with fine-tuning of a pre-trained base network.
arXiv Detail & Related papers (2020-10-18T23:50:28Z)
SPECTER: Document-level Representation Learning using Citation-informed Transformers [51.048515757909215]
SPECTER generates document-level embedding of scientific documents based on pretraining a Transformer language model. We introduce SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction to document classification and recommendation.
arXiv Detail & Related papers (2020-04-15T16:05:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.