Astro-HEP-BERT: A bidirectional language model for studying the meanings of concepts in astrophysics and high energy physics
- URL: http://arxiv.org/abs/2411.14877v1
- Date: Fri, 22 Nov 2024 11:59:15 GMT
- Title: Astro-HEP-BERT: A bidirectional language model for studying the meanings of concepts in astrophysics and high energy physics
- Authors: Arno Simons,
- Abstract summary: The project demonstrates the effectiveness and feasibility of adapting a bidirectional transformer for applications in the history, philosophy, and sociology of science.
The entire training process was conducted using freely available code, pretrained weights, and text inputs, completed on a single MacBook Pro Laptop.
Preliminary evaluations indicate that Astro-HEP-BERT's CWEs perform comparably to domain-adapted BERT models trained from scratch on larger datasets.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: I present Astro-HEP-BERT, a transformer-based language model specifically designed for generating contextualized word embeddings (CWEs) to study the meanings of concepts in astrophysics and high-energy physics. Built on a general pretrained BERT model, Astro-HEP-BERT underwent further training over three epochs using the Astro-HEP Corpus, a dataset I curated from 21.84 million paragraphs extracted from more than 600,000 scholarly articles on arXiv, all belonging to at least one of these two scientific domains. The project demonstrates both the effectiveness and feasibility of adapting a bidirectional transformer for applications in the history, philosophy, and sociology of science (HPSS). The entire training process was conducted using freely available code, pretrained weights, and text inputs, completed on a single MacBook Pro Laptop (M2/96GB). Preliminary evaluations indicate that Astro-HEP-BERT's CWEs perform comparably to domain-adapted BERT models trained from scratch on larger datasets for domain-specific word sense disambiguation and induction and related semantic change analyses. This suggests that retraining general language models for specific scientific domains can be a cost-effective and efficient strategy for HPSS researchers, enabling high performance without the need for extensive training from scratch.
Related papers
- Pretraining Language Models for Diachronic Linguistic Change Discovery [8.203894221271302]
We show that efficient pretraining techniques can produce useful models over corpora too large for easy manual inspection.
We employ a novel date-attribution pipeline in order to obtain a temporally-segmented dataset of five 10-million-word slices.
We find that the pretrained models are faster to train than the finetuned baselines and that they better respect the historical divisions of our corpus.
arXiv Detail & Related papers (2025-04-07T21:51:32Z) - Meaning at the Planck scale? Contextualized word embeddings for doing history, philosophy, and sociology of science [0.0]
Using the term "Planck" as a test case, I evaluate five BERT-based models with varying degrees of domain-specific pretraining.
Results demonstrate that the domain-adapted models outperform the general-purpose ones in disambiguating the target term.
The study underscores the importance of domain-specific pretraining for analyzing scientific language.
arXiv Detail & Related papers (2024-11-21T12:38:23Z) - Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation [51.750634349748736]
Text-to-video (T2V) models have made significant strides in visualizing complex prompts.
However, the capacity of these models to accurately represent intuitive physics remains largely unexplored.
We introduce PhyGenBench to evaluate physical commonsense correctness in T2V generation.
arXiv Detail & Related papers (2024-10-07T17:56:04Z) - PhysBERT: A Text Embedding Model for Physics Scientific Literature [0.0]
In this work, we introduce PhysBERT, the first physics-specific text embedding model.
Pre-trained on a curated corpus of 1.2 million arXiv physics papers and fine-tuned with supervised data, PhysBERT outperforms leading general-purpose models on physics-specific tasks.
arXiv Detail & Related papers (2024-08-18T19:18:12Z) - SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks.
SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - INDUS: Effective and Efficient Language Models for Scientific Applications [8.653859684720231]
Large language models (LLMs) trained on general domain corpora showed remarkable results on natural language processing (NLP) tasks.
We developed INDUS, a comprehensive suite of LLMs tailored for the closely-related domains of Earth science, biology, physics, heliophysics, planetary sciences and astrophysics.
We show that our models outperform both general-purpose (RoBERTa) and domain-specific (SCIBERT) encoders on new tasks as well as existing tasks in the domains of interest.
arXiv Detail & Related papers (2024-05-17T12:15:07Z) - An Emulator for Fine-Tuning Large Language Models using Small Language
Models [91.02498576056057]
We introduce emulated fine-tuning (EFT), a principled and practical method for sampling from a distribution that approximates the result of pre-training and fine-tuning at different scales.
We show that EFT enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training.
Finally, a special case of emulated fine-tuning, which we call LM up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling them with small fine-tuned models.
arXiv Detail & Related papers (2023-10-19T17:57:16Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - Leveraging Domain Agnostic and Specific Knowledge for Acronym
Disambiguation [5.766754189548904]
Acronym disambiguation aims to find the correct meaning of an ambiguous acronym in a text.
We propose a Hierarchical Dual-path BERT method coined hdBERT to capture the general fine-grained and high-level specific representations.
With a widely adopted SciAD dataset contained 62,441 sentences, we investigate the effectiveness of hdBERT.
arXiv Detail & Related papers (2021-07-01T09:10:00Z) - ELECTRAMed: a new pre-trained language representation model for
biomedical NLP [0.0]
We propose a pre-trained domain-specific language model, called ELECTRAMed, suited for the biomedical field.
The novel approach inherits the learning framework of the general-domain ELECTRA architecture, as well as its computational advantages.
arXiv Detail & Related papers (2021-04-19T19:38:34Z) - Physics-Integrated Variational Autoencoders for Robust and Interpretable
Generative Modeling [86.9726984929758]
We focus on the integration of incomplete physics models into deep generative models.
We propose a VAE architecture in which a part of the latent space is grounded by physics.
We demonstrate generative performance improvements over a set of synthetic and real-world datasets.
arXiv Detail & Related papers (2021-02-25T20:28:52Z) - Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space [109.79957125584252]
Variational Autoencoder (VAE) can be both a powerful generative model and an effective representation learning framework for natural language.
In this paper, we propose the first large-scale language VAE model, Optimus.
arXiv Detail & Related papers (2020-04-05T06:20:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.