Entity Recognition and Relation Extraction from Scientific and Technical
Texts in Russian
- URL: http://arxiv.org/abs/2011.09817v3
- Date: Sat, 26 Dec 2020 08:21:42 GMT
- Title: Entity Recognition and Relation Extraction from Scientific and Technical
Texts in Russian
- Authors: Elena Bruches, Alexey Pauls, Tatiana Batura, Vladimir Isachenko
- Abstract summary: This paper is devoted to the study of methods for information extraction from scientific texts on information technology.
Several modifications of methods for the Russian language are proposed.
It also includes the results of experiments comparing a keyword extraction method, vocabulary method, and some methods based on neural networks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper is devoted to the study of methods for information extraction
(entity recognition and relation classification) from scientific texts on
information technology. Scientific publications provide valuable information
into cutting-edge scientific advances, but efficient processing of increasing
amounts of data is a time-consuming task. In this paper, several modifications
of methods for the Russian language are proposed. It also includes the results
of experiments comparing a keyword extraction method, vocabulary method, and
some methods based on neural networks. Text collections for these tasks exist
for the English language and are actively used by the scientific community, but
at present, such datasets in Russian are not publicly available. In this paper,
we present a corpus of scientific texts in Russian, RuSERRC. This dataset
consists of 1600 unlabeled documents and 80 labeled with entities and semantic
relations (6 relation types were considered). The dataset and models are
available at https://github.com/iis-research-team. We hope they can be useful
for research purposes and development of information extraction systems.
Related papers
- SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles.
Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z) - SciDMT: A Large-Scale Corpus for Detecting Scientific Mentions [52.35520385083425]
We present SciDMT, an enhanced and expanded corpus for scientific mention detection.
The corpus consists of two components: 1) the SciDMT main corpus, which includes 48 thousand scientific articles with over 1.8 million weakly annotated mention annotations in the format of in-text span, and 2) an evaluation set, which comprises 100 scientific articles manually annotated for evaluation purposes.
arXiv Detail & Related papers (2024-06-20T22:03:21Z) - A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - Russian-Language Multimodal Dataset for Automatic Summarization of Scientific Papers [0.20482269513546458]
The paper discusses the creation of a multimodal dataset of Russian-language scientific papers and testing of existing language models for the task of automatic text summarization.
A feature of the dataset is its multimodal data, which includes texts, tables and figures.
arXiv Detail & Related papers (2024-05-13T16:21:33Z) - Automatic Aspect Extraction from Scientific Texts [0.9208007322096533]
We present a cross-domain dataset of scientific texts in Russian, annotated with such aspects as Task, Contribution, Method, and Conclusion.
We show that there are some differences in aspect representation in different domains, but our model was trained on a limited number of scientific domains, it is still able to generalize to new domains.
arXiv Detail & Related papers (2023-10-06T07:59:54Z) - Uzbek text summarization based on TF-IDF [0.0]
This article presents an experiment on summarization task for Uzbek language.
The methodology was based on text abstracting based on TF-IDF algorithm.
We summarize the given text by applying the n-gram method to important parts of the whole text.
arXiv Detail & Related papers (2023-03-01T12:39:46Z) - TERMinator: A system for scientific texts processing [0.0]
This paper is devoted to the extraction of entities and semantic relations between them from scientific texts.
We present a dataset that includes annotations for two tasks and develop a system called TERMinator for the study of the influence of language models on term recognition.
arXiv Detail & Related papers (2022-09-29T15:14:42Z) - A system for information extraction from scientific texts in Russian [0.0]
The system performs several tasks in an end-to-end manner: term recognition, extraction of relations between terms, and term linking with entities from the knowledge base.
The advantage of the implemented methods is that the system does not require a large amount of labeled data, which saves time and effort for data labeling.
The source code is publicly available and can be used for different research purposes.
arXiv Detail & Related papers (2021-09-14T14:08:37Z) - CitationIE: Leveraging the Citation Graph for Scientific Information
Extraction [89.33938657493765]
We use the citation graph of referential links between citing and cited papers.
We observe a sizable improvement in end-to-end information extraction over the state-of-the-art.
arXiv Detail & Related papers (2021-06-03T03:00:12Z) - Be More with Less: Hypergraph Attention Networks for Inductive Text
Classification [56.98218530073927]
Graph neural networks (GNNs) have received increasing attention in the research community and demonstrated their promising results on this canonical task.
Despite the success, their performance could be largely jeopardized in practice since they are unable to capture high-order interaction between words.
We propose a principled model -- hypergraph attention networks (HyperGAT) which can obtain more expressive power with less computational consumption for text representation learning.
arXiv Detail & Related papers (2020-11-01T00:21:59Z) - A Comparative Study on Structural and Semantic Properties of Sentence
Embeddings [77.34726150561087]
We propose a set of experiments using a widely-used large-scale data set for relation extraction.
We show that different embedding spaces have different degrees of strength for the structural and semantic properties.
These results provide useful information for developing embedding-based relation extraction methods.
arXiv Detail & Related papers (2020-09-23T15:45:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.