MuLMS-AZ: An Argumentative Zoning Dataset for the Materials Science
Domain
- URL: http://arxiv.org/abs/2307.02340v1
- Date: Wed, 5 Jul 2023 14:55:18 GMT
- Title: MuLMS-AZ: An Argumentative Zoning Dataset for the Materials Science
Domain
- Authors: Timo Pierre Schrader, Teresa B\"urkle, Sophie Henning, Sherry Tan,
Matteo Finco, Stefan Gr\"unewald, Maira Indrikova, Felix Hildebrand,
Annemarie Friedrich
- Abstract summary: Classifying the Argumentative Zone (AZ) has been proposed to improve processing of scholarly documents.
We present and release a new dataset of 50 manually annotated research articles.
- Score: 1.209268134212644
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Scientific publications follow conventionalized rhetorical structures.
Classifying the Argumentative Zone (AZ), e.g., identifying whether a sentence
states a Motivation, a Result or Background information, has been proposed to
improve processing of scholarly documents. In this work, we adapt and extend
this idea to the domain of materials science research. We present and release a
new dataset of 50 manually annotated research articles. The dataset spans seven
sub-topics and is annotated with a materials-science focused multi-label
annotation scheme for AZ. We detail corpus statistics and demonstrate high
inter-annotator agreement. Our computational experiments show that using
domain-specific pre-trained transformer-based text encoders is key to high
classification performance. We also find that AZ categories from existing
datasets in other domains are transferable to varying degrees.
Related papers
- Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored.
We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches.
We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z) - Classification and Clustering of Sentence-Level Embeddings of Scientific Articles Generated by Contrastive Learning [1.104960878651584]
Our approach consists of fine-tuning transformer language models to generate sentence-level embeddings from scientific articles.
We trained our models on three datasets with contrastive learning.
We show that fine-tuning sentence transformers with contrastive learning and using the generated embeddings in downstream tasks is a feasible approach to sentence classification in scientific articles.
arXiv Detail & Related papers (2024-03-30T02:52:14Z) - SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval [64.03631654052445]
Current benchmarks for evaluating MMIR performance in image-text pairing within the scientific domain show a notable gap.
We develop a specialised scientific MMIR benchmark by leveraging open-access paper collections.
This benchmark comprises 530K meticulously curated image-text pairs, extracted from figures and tables with detailed captions in scientific documents.
arXiv Detail & Related papers (2024-01-24T14:23:12Z) - Seed-Guided Fine-Grained Entity Typing in Science and Engineering
Domains [51.02035914828596]
We study the task of seed-guided fine-grained entity typing in science and engineering domains.
We propose SEType which first enriches the weak supervision by finding more entities for each seen type from an unlabeled corpus.
It then matches the enriched entities to unlabeled text to get pseudo-labeled samples and trains a textual entailment model that can make inferences for both seen and unseen types.
arXiv Detail & Related papers (2024-01-23T22:36:03Z) - MuLMS: A Multi-Layer Annotated Text Corpus for Information Extraction in
the Materials Science Domain [0.7947524927438001]
We present MuLMS, a new dataset of 50 open-access articles, spanning seven sub-domains of materials science.
We present competitive neural models for all tasks and demonstrate that multi-task training with existing related resources leads to benefits.
arXiv Detail & Related papers (2023-10-24T07:23:46Z) - Automatic Aspect Extraction from Scientific Texts [0.9208007322096533]
We present a cross-domain dataset of scientific texts in Russian, annotated with such aspects as Task, Contribution, Method, and Conclusion.
We show that there are some differences in aspect representation in different domains, but our model was trained on a limited number of scientific domains, it is still able to generalize to new domains.
arXiv Detail & Related papers (2023-10-06T07:59:54Z) - Open Domain Question Answering over Virtual Documents: A Unified
Approach for Data and Text [62.489652395307914]
We use the data-to-text method as a means for encoding structured knowledge for knowledge-intensive applications, i.e. open-domain question answering (QA)
Specifically, we propose a verbalizer-retriever-reader framework for open-domain QA over data and text where verbalized tables from Wikipedia and triples from Wikidata are used as augmented knowledge sources.
We show that our Unified Data and Text QA, UDT-QA, can effectively benefit from the expanded knowledge index, leading to large gains over text-only baselines.
arXiv Detail & Related papers (2021-10-16T00:11:21Z) - WikiAsp: A Dataset for Multi-domain Aspect-based Summarization [69.13865812754058]
We propose WikiAsp, a large-scale dataset for multi-domain aspect-based summarization.
Specifically, we build the dataset using Wikipedia articles from 20 different domains, using the section titles and boundaries of each article as a proxy for aspect annotation.
Results highlight key challenges that existing summarization models face in this setting, such as proper pronoun handling of quoted sources and consistent explanation of time-sensitive events.
arXiv Detail & Related papers (2020-11-16T10:02:52Z) - Pretrained Transformers for Text Ranking: BERT and Beyond [53.83210899683987]
This survey provides an overview of text ranking with neural network architectures known as transformers.
The combination of transformers and self-supervised pretraining has been responsible for a paradigm shift in natural language processing.
arXiv Detail & Related papers (2020-10-13T15:20:32Z) - Machine Identification of High Impact Research through Text and Image
Analysis [0.4737991126491218]
We present a system to automatically separate papers with a high from those with a low likelihood of gaining citations.
Our system uses both a visual classifier, useful for surmising a document's overall appearance, and a text classifier, for making content-informed decisions.
arXiv Detail & Related papers (2020-05-20T19:12:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.