Improving astroBERT using Semantic Textual Similarity
- URL: http://arxiv.org/abs/2212.00744v1
- Date: Tue, 29 Nov 2022 16:15:32 GMT
- Title: Improving astroBERT using Semantic Textual Similarity
- Authors: Felix Grezes, Thomas Allen, Sergi Blanco-Cuaresma, Alberto Accomazzi,
Michael J. Kurtz, Golnaz Shapurian, Edwin Henneken, Carolyn S. Grant, Donna
M. Thompson, Timothy W. Hostetler, Matthew R. Templeton, Kelly E. Lockhart,
Shinyi Chen, Jennifer Koch, Taylor Jacovich, and Pavlos Protopapas
- Abstract summary: We introduce astroBERT, a machine learning language model tailored to the text used in astronomy papers in NASA's Astrophysics Data System (ADS)
We show how astroBERT improves over existing public language models on astrophysics specific tasks.
We detail how ADS plans to harness the unique structure of scientific papers, the citation graph and citation context to further improve astroBERT.
- Score: 0.785116730789274
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The NASA Astrophysics Data System (ADS) is an essential tool for researchers
that allows them to explore the astronomy and astrophysics scientific
literature, but it has yet to exploit recent advances in natural language
processing. At ADASS 2021, we introduced astroBERT, a machine learning language
model tailored to the text used in astronomy papers in ADS. In this work we:
- announce the first public release of the astroBERT language model;
- show how astroBERT improves over existing public language models on
astrophysics specific tasks;
- and detail how ADS plans to harness the unique structure of scientific
papers, the citation graph and citation context, to further improve astroBERT.
Related papers
- AstroLLaVA: towards the unification of astronomical data and natural language [0.0]
We present AstroLLaVA, a vision language model for astronomy that enables interaction with astronomical imagery through natural dialogue.
Our two-stage fine-tuning process adapts the model to both image captioning and visual question answering in the astronomy domain.
We demonstrate AstroLLaVA's performance on an astronomical visual question answering benchmark and release the model weights, code, and training set to encourage further open source work.
arXiv Detail & Related papers (2025-04-11T14:36:31Z) - Astro-HEP-BERT: A bidirectional language model for studying the meanings of concepts in astrophysics and high energy physics [0.0]
The project demonstrates the effectiveness and feasibility of adapting a bidirectional transformer for applications in the history, philosophy, and sociology of science.
The entire training process was conducted using freely available code, pretrained weights, and text inputs, completed on a single MacBook Pro Laptop.
Preliminary evaluations indicate that Astro-HEP-BERT's CWEs perform comparably to domain-adapted BERT models trained from scratch on larger datasets.
arXiv Detail & Related papers (2024-11-22T11:59:15Z) - Delving into the Utilisation of ChatGPT in Scientific Publications in Astronomy [0.0]
We show that ChatGPT uses more often than humans when generating academic text and search a total of 1 million articles for them.
We identify a list of words favoured by ChatGPT and find a statistically significant increase for these words against a control group in 2024.
These results suggest a widespread adoption of these models in the writing of astronomy papers.
arXiv Detail & Related papers (2024-06-25T07:15:10Z) - SciInstruct: a Self-Reflective Instruction Annotated Dataset for Training Scientific Language Models [57.96527452844273]
We introduce SciInstruct, a suite of scientific instructions for training scientific language models capable of college-level scientific reasoning.
We curated a diverse and high-quality dataset encompassing physics, chemistry, math, and formal proofs.
To verify the effectiveness of SciInstruct, we fine-tuned different language models with SciInstruct, i.e., ChatGLM3 (6B and 32B), Llama3-8B-Instruct, and Mistral-7B: MetaMath.
arXiv Detail & Related papers (2024-01-15T20:22:21Z) - AstroLLaMA-Chat: Scaling AstroLLaMA with Conversational and Diverse
Datasets [7.53209156977206]
We explore the potential of enhancing LLM performance in astronomy-focused question-answering through targeted, continual pre-training.
We achieve notable improvements in specialized topic comprehension using a curated set of astronomy corpora.
We present an extension of AstroLLaMA: the fine-tuning of the 7B LLaMA model on a domain-specific conversational dataset, culminating in the release of the chat-enabled AstroLLaMA for community use.
arXiv Detail & Related papers (2024-01-03T04:47:02Z) - GeoGalactica: A Scientific Large Language Model in Geoscience [95.15911521220052]
Large language models (LLMs) have achieved huge success for their general knowledge and ability to solve a wide spectrum of tasks in natural language processing (NLP)
We specialize an LLM into geoscience, by further pre-training the model with a vast amount of texts in geoscience, as well as supervised fine-tuning (SFT) the resulting model with our custom collected instruction tuning dataset.
We train GeoGalactica over a geoscience-related text corpus containing 65 billion tokens, preserving as the largest geoscience-specific text corpus.
Then we fine-tune the model with 1 million pairs of instruction-tuning
arXiv Detail & Related papers (2023-12-31T09:22:54Z) - Large Language Models for Scientific Synthesis, Inference and
Explanation [56.41963802804953]
We show how large language models can perform scientific synthesis, inference, and explanation.
We show that the large language model can augment this "knowledge" by synthesizing from the scientific literature.
This approach has the further advantage that the large language model can explain the machine learning system's predictions.
arXiv Detail & Related papers (2023-10-12T02:17:59Z) - AstroLLaMA: Towards Specialized Foundation Models in Astronomy [1.1694367694169385]
We introduce AstroLLaMA, a 7-billion- parameter model fine-tuned from LLaMA-2 using over 300,000 astronomy abstracts from arXiv.
Our model generates more insightful and scientifically relevant text completions and embedding extraction than state-of-the-arts foundation models.
Its public release aims to spur astronomy-focused research, including automatic paper summarization and conversational agent development.
arXiv Detail & Related papers (2023-09-12T11:02:27Z) - The Semantic Scholar Open Data Platform [79.4493235243312]
Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature.
We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction.
The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings.
arXiv Detail & Related papers (2023-01-24T17:13:08Z) - Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language.
We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z) - Building astroBERT, a language model for Astronomy & Astrophysics [1.4587241287997816]
We are applying modern machine learning and natural language processing techniques to NASA Astrophysics Data System (ADS) dataset.
We are training astroBERT, a deeply contextual language model based on research at Google.
Using astroBERT, we aim to enrich the ADS dataset and improve its discoverability, and in particular we are developing our own named entity recognition tool.
arXiv Detail & Related papers (2021-12-01T16:01:46Z) - First Full-Event Reconstruction from Imaging Atmospheric Cherenkov
Telescope Real Data with Deep Learning [55.41644538483948]
The Cherenkov Telescope Array is the future of ground-based gamma-ray astronomy.
Its first prototype telescope built on-site, the Large Size Telescope 1, is currently under commissioning and taking its first scientific data.
We present for the first time the development of a full-event reconstruction based on deep convolutional neural networks and its application to real data.
arXiv Detail & Related papers (2021-05-31T12:51:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.