Automatic Aspect Extraction from Scientific Texts
- URL: http://arxiv.org/abs/2310.04074v1
- Date: Fri, 6 Oct 2023 07:59:54 GMT
- Title: Automatic Aspect Extraction from Scientific Texts
- Authors: Anna Marshalova, Elena Bruches, Tatiana Batura
- Abstract summary: We present a cross-domain dataset of scientific texts in Russian, annotated with such aspects as Task, Contribution, Method, and Conclusion.
We show that there are some differences in aspect representation in different domains, but our model was trained on a limited number of scientific domains, it is still able to generalize to new domains.
- Score: 0.9208007322096533
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Being able to extract from scientific papers their main points, key insights,
and other important information, referred to here as aspects, might facilitate
the process of conducting a scientific literature review. Therefore, the aim of
our research is to create a tool for automatic aspect extraction from
Russian-language scientific texts of any domain. In this paper, we present a
cross-domain dataset of scientific texts in Russian, annotated with such
aspects as Task, Contribution, Method, and Conclusion, as well as a baseline
algorithm for aspect extraction, based on the multilingual BERT model
fine-tuned on our data. We show that there are some differences in aspect
representation in different domains, but even though our model was trained on a
limited number of scientific domains, it is still able to generalize to new
domains, as was proved by cross-domain experiments. The code and the dataset
are available at
\url{https://github.com/anna-marshalova/automatic-aspect-extraction-from-scientific-texts}.
Related papers
- Exploring Fine-tuned Generative Models for Keyphrase Selection: A Case Study for Russian [1.565361244756411]
We explored how to apply fine-tuned generative transformer-based models to the specific task of keyphrase selection within Russian scientific texts.
Experiments were conducted on the texts of Russian scientific abstracts from four domains: mathematics & computer science, history, medicine, and linguistics.
The use of generative models, namely mBART, led to gains in in-domain performance (up to 4.9% in BERTScore, 9.0% in ROUGE-1, and 12.2% in F1-score) over three keyphrase extraction baselines for the Russian language.
arXiv Detail & Related papers (2024-09-16T18:15:28Z) - SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval [64.03631654052445]
Current benchmarks for evaluating MMIR performance in image-text pairing within the scientific domain show a notable gap.
We develop a specialised scientific MMIR benchmark by leveraging open-access paper collections.
This benchmark comprises 530K meticulously curated image-text pairs, extracted from figures and tables with detailed captions in scientific documents.
arXiv Detail & Related papers (2024-01-24T14:23:12Z) - Seed-Guided Fine-Grained Entity Typing in Science and Engineering
Domains [51.02035914828596]
We study the task of seed-guided fine-grained entity typing in science and engineering domains.
We propose SEType which first enriches the weak supervision by finding more entities for each seen type from an unlabeled corpus.
It then matches the enriched entities to unlabeled text to get pseudo-labeled samples and trains a textual entailment model that can make inferences for both seen and unseen types.
arXiv Detail & Related papers (2024-01-23T22:36:03Z) - Cross-Domain Robustness of Transformer-based Keyphrase Generation [1.8492669447784602]
A list of keyphrases is an important element of a text in databases and repositories of electronic documents.
In our experiments, abstractive text summarization models fine-tuned for keyphrase generation show quite high results for a target text corpus.
We present an evaluation of the fine-tuned BART models for the keyphrase selection task across six benchmark corpora.
arXiv Detail & Related papers (2023-12-17T12:27:15Z) - MuLMS-AZ: An Argumentative Zoning Dataset for the Materials Science
Domain [1.209268134212644]
Classifying the Argumentative Zone (AZ) has been proposed to improve processing of scholarly documents.
We present and release a new dataset of 50 manually annotated research articles.
arXiv Detail & Related papers (2023-07-05T14:55:18Z) - CitationIE: Leveraging the Citation Graph for Scientific Information
Extraction [89.33938657493765]
We use the citation graph of referential links between citing and cited papers.
We observe a sizable improvement in end-to-end information extraction over the state-of-the-art.
arXiv Detail & Related papers (2021-06-03T03:00:12Z) - FDMT: A Benchmark Dataset for Fine-grained Domain Adaptation in Machine
Translation [53.87731008029645]
We present a real-world fine-grained domain adaptation task in machine translation (FDMT)
The FDMT dataset consists of four sub-domains of information technology: autonomous vehicles, AI education, real-time networks and smart phone.
We make quantitative experiments and deep analyses in this new setting, which benchmarks the fine-grained domain adaptation task.
arXiv Detail & Related papers (2020-12-31T17:15:09Z) - Entity Recognition and Relation Extraction from Scientific and Technical
Texts in Russian [0.0]
This paper is devoted to the study of methods for information extraction from scientific texts on information technology.
Several modifications of methods for the Russian language are proposed.
It also includes the results of experiments comparing a keyword extraction method, vocabulary method, and some methods based on neural networks.
arXiv Detail & Related papers (2020-11-19T13:40:03Z) - WikiAsp: A Dataset for Multi-domain Aspect-based Summarization [69.13865812754058]
We propose WikiAsp, a large-scale dataset for multi-domain aspect-based summarization.
Specifically, we build the dataset using Wikipedia articles from 20 different domains, using the section titles and boundaries of each article as a proxy for aspect annotation.
Results highlight key challenges that existing summarization models face in this setting, such as proper pronoun handling of quoted sources and consistent explanation of time-sensitive events.
arXiv Detail & Related papers (2020-11-16T10:02:52Z) - Semantic and Relational Spaces in Science of Science: Deep Learning
Models for Article Vectorisation [4.178929174617172]
We focus on document-level embeddings based on the semantic and relational aspects of articles, using Natural Language Processing (NLP) and Graph Neural Networks (GNNs)
Our results show that using NLP we can encode a semantic space of articles, while with GNN we are able to build a relational space where the social practices of a research community are also encoded.
arXiv Detail & Related papers (2020-11-05T14:57:41Z) - Intrinsic Probing through Dimension Selection [69.52439198455438]
Most modern NLP systems make use of pre-trained contextual representations that attain astonishingly high performance on a variety of tasks.
Such high performance should not be possible unless some form of linguistic structure inheres in these representations, and a wealth of research has sprung up on probing for it.
In this paper, we draw a distinction between intrinsic probing, which examines how linguistic information is structured within a representation, and the extrinsic probing popular in prior work, which only argues for the presence of such information by showing that it can be successfully extracted.
arXiv Detail & Related papers (2020-10-06T15:21:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.