Building astroBERT, a language model for Astronomy & Astrophysics
- URL: http://arxiv.org/abs/2112.00590v1
- Date: Wed, 1 Dec 2021 16:01:46 GMT
- Title: Building astroBERT, a language model for Astronomy & Astrophysics
- Authors: Felix Grezes, Sergi Blanco-Cuaresma, Alberto Accomazzi, Michael J.
Kurtz, Golnaz Shapurian, Edwin Henneken, Carolyn S. Grant, Donna M. Thompson,
Roman Chyla, Stephen McDonald, Timothy W. Hostetler, Matthew R. Templeton,
Kelly E. Lockhart, Nemanja Martinovic, Shinyi Chen, Chris Tanner, Pavlos
Protopapas
- Abstract summary: We are applying modern machine learning and natural language processing techniques to NASA Astrophysics Data System (ADS) dataset.
We are training astroBERT, a deeply contextual language model based on research at Google.
Using astroBERT, we aim to enrich the ADS dataset and improve its discoverability, and in particular we are developing our own named entity recognition tool.
- Score: 1.4587241287997816
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The existing search tools for exploring the NASA Astrophysics Data System
(ADS) can be quite rich and empowering (e.g., similar and trending operators),
but researchers are not yet allowed to fully leverage semantic search. For
example, a query for "results from the Planck mission" should be able to
distinguish between all the various meanings of Planck (person, mission,
constant, institutions and more) without further clarification from the user.
At ADS, we are applying modern machine learning and natural language processing
techniques to our dataset of recent astronomy publications to train astroBERT,
a deeply contextual language model based on research at Google. Using
astroBERT, we aim to enrich the ADS dataset and improve its discoverability,
and in particular we are developing our own named entity recognition tool. We
present here our preliminary results and lessons learned.
Related papers
- INDUS: Effective and Efficient Language Models for Scientific Applications [8.76933154920986]
Large language models (LLMs) trained on general domain corpora showed remarkable results on natural language processing (NLP) tasks.
Previous research demonstrated LLMs trained using domain-focused corpora perform better on specialized tasks.
We developed INDUS, a comprehensive suite of LLMs tailored for the Earth science, biology, physics, heliophysics, planetary sciences and astrophysics domains.
arXiv Detail & Related papers (2024-05-17T12:15:07Z) - GeoGalactica: A Scientific Large Language Model in Geoscience [95.15911521220052]
Large language models (LLMs) have achieved huge success for their general knowledge and ability to solve a wide spectrum of tasks in natural language processing (NLP)
We specialize an LLM into geoscience, by further pre-training the model with a vast amount of texts in geoscience, as well as supervised fine-tuning (SFT) the resulting model with our custom collected instruction tuning dataset.
We train GeoGalactica over a geoscience-related text corpus containing 65 billion tokens, preserving as the largest geoscience-specific text corpus.
Then we fine-tune the model with 1 million pairs of instruction-tuning
arXiv Detail & Related papers (2023-12-31T09:22:54Z) - Reward Finetuning for Faster and More Accurate Unsupervised Object
Discovery [64.41455104593304]
Reinforcement Learning from Human Feedback (RLHF) can improve machine learning models and align them with human preferences.
We propose to adapt similar RL-based methods to unsupervised object discovery.
We demonstrate that our approach is not only more accurate, but also orders of magnitudes faster to train.
arXiv Detail & Related papers (2023-10-29T17:03:12Z) - Large Language Models for Scientific Synthesis, Inference and
Explanation [56.41963802804953]
We show how large language models can perform scientific synthesis, inference, and explanation.
We show that the large language model can augment this "knowledge" by synthesizing from the scientific literature.
This approach has the further advantage that the large language model can explain the machine learning system's predictions.
arXiv Detail & Related papers (2023-10-12T02:17:59Z) - Radio astronomical images object detection and segmentation: A benchmark
on deep learning methods [5.058069142315917]
In this work, we explore the performance of the most affirmed deep learning approaches, applied to astronomical images obtained by radio interferometric instrumentation, to solve the task of automatic source detection.
The goal is to provide an overview of existing techniques, in terms of prediction performance and computational efficiency, to scientists in the astrophysics community who would like to employ machine learning in their research.
arXiv Detail & Related papers (2023-03-08T10:55:24Z) - Applications of AI in Astronomy [0.0]
We provide an overview of the use of Machine Learning (ML) and other AI methods in astronomy, astrophysics, and cosmology.
Over the past decade we have seen an exponential growth of the astronomical literature involving a variety of ML/AI applications.
As the data complexity continues to increase, we anticipate further advances leading towards a collaborative human-AI discovery.
arXiv Detail & Related papers (2022-12-03T00:38:59Z) - Improving astroBERT using Semantic Textual Similarity [0.785116730789274]
We introduce astroBERT, a machine learning language model tailored to the text used in astronomy papers in NASA's Astrophysics Data System (ADS)
We show how astroBERT improves over existing public language models on astrophysics specific tasks.
We detail how ADS plans to harness the unique structure of scientific papers, the citation graph and citation context to further improve astroBERT.
arXiv Detail & Related papers (2022-11-29T16:15:32Z) - Elements of effective machine learning datasets in astronomy [1.552171919003135]
We identify elements of effective machine learning datasets in astronomy.
We discuss why these elements are important for astronomical applications and ways to put them in practice.
arXiv Detail & Related papers (2022-11-25T23:37:24Z) - Robotic Skill Acquisition via Instruction Augmentation with
Vision-Language Models [70.82705830137708]
We introduce Data-driven Instruction Augmentation for Language-conditioned control (DIAL)
We utilize semi-language labels leveraging the semantic understanding of CLIP to propagate knowledge onto large datasets of unlabelled demonstration data.
DIAL enables imitation learning policies to acquire new capabilities and generalize to 60 novel instructions unseen in the original dataset.
arXiv Detail & Related papers (2022-11-21T18:56:00Z) - LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language,
Vision, and Action [76.71101507291473]
We present a system, LM-Nav, for robotic navigation that enjoys the benefits of training on unannotated large datasets of trajectories.
We show that such a system can be constructed entirely out of pre-trained models for navigation (ViNG), image-language association (CLIP), and language modeling (GPT-3), without requiring any fine-tuning or language-annotated robot data.
arXiv Detail & Related papers (2022-07-10T10:41:50Z) - Rapid Exploration for Open-World Navigation with Latent Goal Models [78.45339342966196]
We describe a robotic learning system for autonomous exploration and navigation in diverse, open-world environments.
At the core of our method is a learned latent variable model of distances and actions, along with a non-parametric topological memory of images.
We use an information bottleneck to regularize the learned policy, giving us (i) a compact visual representation of goals, (ii) improved generalization capabilities, and (iii) a mechanism for sampling feasible goals for exploration.
arXiv Detail & Related papers (2021-04-12T23:14:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.