Building astroBERT, a language model for Astronomy & Astrophysics
- URL: http://arxiv.org/abs/2112.00590v1
- Date: Wed, 1 Dec 2021 16:01:46 GMT
- Title: Building astroBERT, a language model for Astronomy & Astrophysics
- Authors: Felix Grezes, Sergi Blanco-Cuaresma, Alberto Accomazzi, Michael J.
Kurtz, Golnaz Shapurian, Edwin Henneken, Carolyn S. Grant, Donna M. Thompson,
Roman Chyla, Stephen McDonald, Timothy W. Hostetler, Matthew R. Templeton,
Kelly E. Lockhart, Nemanja Martinovic, Shinyi Chen, Chris Tanner, Pavlos
Protopapas
- Abstract summary: We are applying modern machine learning and natural language processing techniques to NASA Astrophysics Data System (ADS) dataset.
We are training astroBERT, a deeply contextual language model based on research at Google.
Using astroBERT, we aim to enrich the ADS dataset and improve its discoverability, and in particular we are developing our own named entity recognition tool.
- Score: 1.4587241287997816
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The existing search tools for exploring the NASA Astrophysics Data System
(ADS) can be quite rich and empowering (e.g., similar and trending operators),
but researchers are not yet allowed to fully leverage semantic search. For
example, a query for "results from the Planck mission" should be able to
distinguish between all the various meanings of Planck (person, mission,
constant, institutions and more) without further clarification from the user.
At ADS, we are applying modern machine learning and natural language processing
techniques to our dataset of recent astronomy publications to train astroBERT,
a deeply contextual language model based on research at Google. Using
astroBERT, we aim to enrich the ADS dataset and improve its discoverability,
and in particular we are developing our own named entity recognition tool. We
present here our preliminary results and lessons learned.
Related papers
- OmniGeo: Towards a Multimodal Large Language Models for Geospatial Artificial Intelligence [51.0456395687016]
multimodal large language models (LLMs) have opened new frontiers in artificial intelligence.
We propose a MLLM (OmniGeo) tailored to geospatial applications.
By combining the strengths of natural language understanding and spatial reasoning, our model enhances the ability of instruction following and the accuracy of GeoAI systems.
arXiv Detail & Related papers (2025-03-20T16:45:48Z) - Large Language Models: New Opportunities for Access to Science [0.0]
The uptake of Retrieval Augmented Generation-enhanced chat applications in the construction of the open science environment of the KM3NeT neutrino detectors serves as a focus point to explore and exemplify prospects for the wider application of Large Language Models for our science.
arXiv Detail & Related papers (2025-01-13T11:58:27Z) - Locate Anything on Earth: Advancing Open-Vocabulary Object Detection for Remote Sensing Community [50.16478515591924]
We propose and train the novel LAE-DINO Model, the first open-vocabulary foundation object detector for the LAE task.
We conduct experiments on established remote sensing benchmark DIOR, DOTAv2.0, as well as our newly introduced 80-class LAE-80C benchmark.
Results demonstrate the advantages of the LAE-1M dataset and the effectiveness of the LAE-DINO method.
arXiv Detail & Related papers (2024-08-17T06:24:43Z) - pathfinder: A Semantic Framework for Literature Review and Knowledge Discovery in Astronomy [2.6952253149772996]
Pathfinder is a machine learning framework designed to enable literature review and knowledge discovery in astronomy.
Our framework couples advanced retrieval techniques with LLM-based synthesis to search astronomical literature by semantic context.
It addresses complexities of jargon, named entities, and temporal aspects through time-based and citation-based weighting schemes.
arXiv Detail & Related papers (2024-08-02T20:05:24Z) - SciInstruct: a Self-Reflective Instruction Annotated Dataset for Training Scientific Language Models [57.96527452844273]
We introduce SciInstruct, a suite of scientific instructions for training scientific language models capable of college-level scientific reasoning.
We curated a diverse and high-quality dataset encompassing physics, chemistry, math, and formal proofs.
To verify the effectiveness of SciInstruct, we fine-tuned different language models with SciInstruct, i.e., ChatGLM3 (6B and 32B), Llama3-8B-Instruct, and Mistral-7B: MetaMath.
arXiv Detail & Related papers (2024-01-15T20:22:21Z) - GeoGalactica: A Scientific Large Language Model in Geoscience [95.15911521220052]
Large language models (LLMs) have achieved huge success for their general knowledge and ability to solve a wide spectrum of tasks in natural language processing (NLP)
We specialize an LLM into geoscience, by further pre-training the model with a vast amount of texts in geoscience, as well as supervised fine-tuning (SFT) the resulting model with our custom collected instruction tuning dataset.
We train GeoGalactica over a geoscience-related text corpus containing 65 billion tokens, preserving as the largest geoscience-specific text corpus.
Then we fine-tune the model with 1 million pairs of instruction-tuning
arXiv Detail & Related papers (2023-12-31T09:22:54Z) - Reward Finetuning for Faster and More Accurate Unsupervised Object
Discovery [64.41455104593304]
Reinforcement Learning from Human Feedback (RLHF) can improve machine learning models and align them with human preferences.
We propose to adapt similar RL-based methods to unsupervised object discovery.
We demonstrate that our approach is not only more accurate, but also orders of magnitudes faster to train.
arXiv Detail & Related papers (2023-10-29T17:03:12Z) - Large Language Models for Scientific Synthesis, Inference and
Explanation [56.41963802804953]
We show how large language models can perform scientific synthesis, inference, and explanation.
We show that the large language model can augment this "knowledge" by synthesizing from the scientific literature.
This approach has the further advantage that the large language model can explain the machine learning system's predictions.
arXiv Detail & Related papers (2023-10-12T02:17:59Z) - Radio astronomical images object detection and segmentation: A benchmark
on deep learning methods [5.058069142315917]
In this work, we explore the performance of the most affirmed deep learning approaches, applied to astronomical images obtained by radio interferometric instrumentation, to solve the task of automatic source detection.
The goal is to provide an overview of existing techniques, in terms of prediction performance and computational efficiency, to scientists in the astrophysics community who would like to employ machine learning in their research.
arXiv Detail & Related papers (2023-03-08T10:55:24Z) - Applications of AI in Astronomy [0.0]
We provide an overview of the use of Machine Learning (ML) and other AI methods in astronomy, astrophysics, and cosmology.
Over the past decade we have seen an exponential growth of the astronomical literature involving a variety of ML/AI applications.
As the data complexity continues to increase, we anticipate further advances leading towards a collaborative human-AI discovery.
arXiv Detail & Related papers (2022-12-03T00:38:59Z) - Improving astroBERT using Semantic Textual Similarity [0.785116730789274]
We introduce astroBERT, a machine learning language model tailored to the text used in astronomy papers in NASA's Astrophysics Data System (ADS)
We show how astroBERT improves over existing public language models on astrophysics specific tasks.
We detail how ADS plans to harness the unique structure of scientific papers, the citation graph and citation context to further improve astroBERT.
arXiv Detail & Related papers (2022-11-29T16:15:32Z) - Elements of effective machine learning datasets in astronomy [1.552171919003135]
We identify elements of effective machine learning datasets in astronomy.
We discuss why these elements are important for astronomical applications and ways to put them in practice.
arXiv Detail & Related papers (2022-11-25T23:37:24Z) - Robotic Skill Acquisition via Instruction Augmentation with
Vision-Language Models [70.82705830137708]
We introduce Data-driven Instruction Augmentation for Language-conditioned control (DIAL)
We utilize semi-language labels leveraging the semantic understanding of CLIP to propagate knowledge onto large datasets of unlabelled demonstration data.
DIAL enables imitation learning policies to acquire new capabilities and generalize to 60 novel instructions unseen in the original dataset.
arXiv Detail & Related papers (2022-11-21T18:56:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.