MIST: a Large-Scale Annotated Resource and Neural Models for Functions
of Modal Verbs in English Scientific Text
- URL: http://arxiv.org/abs/2212.07156v1
- Date: Wed, 14 Dec 2022 11:10:03 GMT
- Title: MIST: a Large-Scale Annotated Resource and Neural Models for Functions
of Modal Verbs in English Scientific Text
- Authors: Sophie Henning, Nicole Macher, Stefan Gr\"unewald, Annemarie Friedrich
- Abstract summary: We introduce the MIST dataset, which contains 3737 modal instances in five scientific domains annotated for their semantic, pragmatic, or rhetorical function.
We systematically evaluate a set of competitive neural architectures on MIST.
Our corpus analysis provides evidence that scientific communities differ in their usage of modal verbs.
- Score: 1.8502316793903635
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modal verbs (e.g., "can", "should", or "must") occur highly frequently in
scientific articles. Decoding their function is not straightforward: they are
often used for hedging, but they may also denote abilities and restrictions.
Understanding their meaning is important for various NLP tasks such as writing
assistance or accurate information extraction from scientific text.
To foster research on the usage of modals in this genre, we introduce the
MIST (Modals In Scientific Text) dataset, which contains 3737 modal instances
in five scientific domains annotated for their semantic, pragmatic, or
rhetorical function. We systematically evaluate a set of competitive neural
architectures on MIST. Transfer experiments reveal that leveraging
non-scientific data is of limited benefit for modeling the distinctions in
MIST. Our corpus analysis provides evidence that scientific communities differ
in their usage of modal verbs, yet, classifiers trained on scientific data
generalize to some extent to unseen scientific domains.
Related papers
- Knowledge AI: Fine-tuning NLP Models for Facilitating Scientific Knowledge Extraction and Understanding [0.0]
This project investigates the efficacy of Large Language Models (LLMs) in understanding and extracting scientific knowledge across specific domains.
We employ pre-trained models and fine-tune them on datasets in the scientific domain.
arXiv Detail & Related papers (2024-08-04T01:32:09Z) - SciDMT: A Large-Scale Corpus for Detecting Scientific Mentions [52.35520385083425]
We present SciDMT, an enhanced and expanded corpus for scientific mention detection.
The corpus consists of two components: 1) the SciDMT main corpus, which includes 48 thousand scientific articles with over 1.8 million weakly annotated mention annotations in the format of in-text span, and 2) an evaluation set, which comprises 100 scientific articles manually annotated for evaluation purposes.
arXiv Detail & Related papers (2024-06-20T22:03:21Z) - A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery [68.48094108571432]
Large language models (LLMs) have revolutionized the way text and other modalities of data are handled.
We aim to provide a more holistic view of the research landscape by unveiling cross-field and cross-modal connections between scientific LLMs.
arXiv Detail & Related papers (2024-06-16T08:03:24Z) - MSciNLI: A Diverse Benchmark for Scientific Natural Language Inference [65.37685198688538]
This paper presents MSciNLI, a dataset containing 132,320 sentence pairs extracted from five new scientific domains.
We establish strong baselines on MSciNLI by fine-tuning Pre-trained Language Models (PLMs) and prompting Large Language Models (LLMs)
We show that domain shift degrades the performance of scientific NLI models which demonstrates the diverse characteristics of different domains in our dataset.
arXiv Detail & Related papers (2024-04-11T18:12:12Z) - SciInstruct: a Self-Reflective Instruction Annotated Dataset for Training Scientific Language Models [57.96527452844273]
We introduce SciInstruct, a suite of scientific instructions for training scientific language models capable of college-level scientific reasoning.
We curated a diverse and high-quality dataset encompassing physics, chemistry, math, and formal proofs.
To verify the effectiveness of SciInstruct, we fine-tuned different language models with SciInstruct, i.e., ChatGLM3 (6B and 32B), Llama3-8B-Instruct, and Mistral-7B: MetaMath.
arXiv Detail & Related papers (2024-01-15T20:22:21Z) - Large Language Models for Scientific Synthesis, Inference and
Explanation [56.41963802804953]
We show how large language models can perform scientific synthesis, inference, and explanation.
We show that the large language model can augment this "knowledge" by synthesizing from the scientific literature.
This approach has the further advantage that the large language model can explain the machine learning system's predictions.
arXiv Detail & Related papers (2023-10-12T02:17:59Z) - Domain-specific ChatBots for Science using Embeddings [0.5687661359570725]
Large language models (LLMs) have emerged as powerful machine-learning systems capable of handling a myriad of tasks.
Here, we demonstrate how existing methods and software tools can be easily combined to yield a domain-specific chatbots.
arXiv Detail & Related papers (2023-06-15T15:26:20Z) - Syntax and Semantics Meet in the "Middle": Probing the Syntax-Semantics
Interface of LMs Through Agentivity [68.8204255655161]
We present the semantic notion of agentivity as a case study for probing such interactions.
This suggests LMs may potentially serve as more useful tools for linguistic annotation, theory testing, and discovery.
arXiv Detail & Related papers (2023-05-29T16:24:01Z) - Context Matters: A Strategy to Pre-train Language Model for Science
Education [4.053049694533914]
BERT-based language models have shown significant superiority over traditional NLP models in various language-related tasks.
The language used by students is different from the language in journals and Wikipedia, which are training sources of BERT.
Our study confirms the effectiveness of continual pre-training on domain-specific data in the education domain.
arXiv Detail & Related papers (2023-01-27T23:50:16Z) - Leveraging knowledge graphs to update scientific word embeddings using
latent semantic imputation [0.0]
We show how glslsi can impute embeddings for domain-specific words from up-to-date knowledge graphs.
We show that LSI can produce reliable embedding vectors for rare and OOV terms in the biomedical domain.
arXiv Detail & Related papers (2022-10-27T12:15:26Z) - Semantic maps and metrics for science Semantic maps and metrics for
science using deep transformer encoders [1.599072005190786]
Recent advances in natural language understanding driven by deep transformer networks offer new possibilities for mapping science.
Transformer embedding models capture shades of association and connotation that vary across different linguistic contexts.
We report a procedure for encoding scientific documents with these tools, measuring their improvement over static word embeddings.
arXiv Detail & Related papers (2021-04-13T04:12:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.