Complex Mathematical Symbol Definition Structures: A Dataset and Model
for Coordination Resolution in Definition Extraction
- URL: http://arxiv.org/abs/2305.14660v1
- Date: Wed, 24 May 2023 02:53:48 GMT
- Title: Complex Mathematical Symbol Definition Structures: A Dataset and Model
for Coordination Resolution in Definition Extraction
- Authors: Anna Martin-Boyle, Andrew Head, Kyle Lo, Risham Sidhu, Marti A.
Hearst, and Dongyeop Kang
- Abstract summary: We present SymDef, an English language dataset of 5,927 sentences from full-text scientific papers.
This dataset focuses specifically on complex coordination structures such as "respectively" constructions.
We introduce a new definition extraction method that masks mathematical symbols, creates a copy of each sentence for each symbol, specifies a target symbol, and predicts its corresponding definition spans using slot filling.
- Score: 27.896132821710783
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mathematical symbol definition extraction is important for improving
scholarly reading interfaces and scholarly information extraction (IE).
However, the task poses several challenges: math symbols are difficult to
process as they are not composed of natural language morphemes; and scholarly
papers often contain sentences that require resolving complex coordinate
structures. We present SymDef, an English language dataset of 5,927 sentences
from full-text scientific papers where each sentence is annotated with all
mathematical symbols linked with their corresponding definitions. This dataset
focuses specifically on complex coordination structures such as "respectively"
constructions, which often contain overlapping definition spans. We also
introduce a new definition extraction method that masks mathematical symbols,
creates a copy of each sentence for each symbol, specifies a target symbol, and
predicts its corresponding definition spans using slot filling. Our experiments
show that our definition extraction model significantly outperforms RoBERTa and
other strong IE baseline systems by 10.9 points with a macro F1 score of 84.82.
With our dataset and model, we can detect complex definitions in scholarly
documents to make scientific writing more readable.
Related papers
- STEM-POM: Evaluating Language Models Math-Symbol Reasoning in Document Parsing [2.2315518704035595]
We introduce STEM-PoM, a benchmark dataset to evaluate large language models' reasoning abilities on math symbols.
The dataset contains over 2K math symbols classified as main attributes of variables, constants, operators, and unit descriptors.
Our experiments show that state-of-the-art LLMs achieve an average of 20-60% accuracy under in-context learning and 50-60% accuracy with fine-tuning.
arXiv Detail & Related papers (2024-11-01T06:25:06Z) - PosFormer: Recognizing Complex Handwritten Mathematical Expression with Position Forest Transformer [51.260384040953326]
Handwritten Mathematical Expression Recognition (HMER) has wide applications in human-machine interaction scenarios.
We propose a position forest transformer (PosFormer) for HMER, which jointly optimize two tasks: expression recognition and position recognition.
PosFormer consistently outperforms the state-of-the-art methods 2.03%/1.22%/2, 1.83%, and 4.62% gains on datasets.
arXiv Detail & Related papers (2024-07-10T15:42:58Z) - Measuring Annotator Agreement Generally across Complex Structured,
Multi-object, and Free-text Annotation Tasks [79.24863171717972]
Inter-annotator agreement (IAA) is a key metric for quality assurance.
Measures exist for simple categorical and ordinal labeling tasks, but little work has considered more complex labeling tasks.
Krippendorff's alpha, best known for use with simpler labeling tasks, does have a distance-based formulation with broader applicability.
arXiv Detail & Related papers (2022-12-15T20:12:48Z) - Structured information extraction from complex scientific text with
fine-tuned large language models [55.96705756327738]
We present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction.
The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts.
This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text.
arXiv Detail & Related papers (2022-12-10T07:51:52Z) - COMPILING: A Benchmark Dataset for Chinese Complexity Controllable
Definition Generation [2.935516292500541]
This paper proposes a novel task of generating definitions for a word with controllable complexity levels.
We introduce COMPILING, a dataset given detailed information about Chinese definitions, and each definition is labeled with its complexity levels.
arXiv Detail & Related papers (2022-09-29T08:17:53Z) - Symlink: A New Dataset for Scientific Symbol-Description Linking [69.97278287534157]
We present a new large-scale dataset that emphasizes extracting symbols and descriptions in scientific documents.
Our experiments on Symlink demonstrate the challenges of the symbol-description linking task for existing models.
arXiv Detail & Related papers (2022-04-26T04:36:14Z) - Compositional Generalization Requires Compositional Parsers [69.77216620997305]
We compare sequence-to-sequence models and models guided by compositional principles on the recent COGS corpus.
We show structural generalization is a key measure of compositional generalization and requires models that are aware of complex structure.
arXiv Detail & Related papers (2022-02-24T07:36:35Z) - Incorporating Constituent Syntax for Coreference Resolution [50.71868417008133]
We propose a graph-based method to incorporate constituent syntactic structures.
We also explore to utilise higher-order neighbourhood information to encode rich structures in constituent trees.
Experiments on the English and Chinese portions of OntoNotes 5.0 benchmark show that our proposed model either beats a strong baseline or achieves new state-of-the-art performance.
arXiv Detail & Related papers (2022-02-22T07:40:42Z) - Automated Discovery of Mathematical Definitions in Text with Deep Neural
Networks [6.172021438837204]
This paper focuses on automatic detection of one-sentence definitions in mathematical texts.
We apply deep learning methods such as the Convolutional Neural Network (CNN) and the Long Short-Term Memory network (LSTM)
We also present a new dataset for definition extraction from mathematical texts.
arXiv Detail & Related papers (2020-11-09T15:57:53Z) - CompLex: A New Corpus for Lexical Complexity Prediction from Likert
Scale Data [13.224233182417636]
This paper presents the first English dataset for continuous lexical complexity prediction.
We use a 5-point Likert scale scheme to annotate complex words in texts from three sources/domains: the Bible, Europarl, and biomedical texts.
arXiv Detail & Related papers (2020-03-16T03:54:22Z) - \AE THEL: Automatically Extracted Typelogical Derivations for Dutch [0.8379286663107844]
AETHEL is a semantic compositionality for written Dutch.
AETHEL's types and derivations are obtained by means of an extraction algorithm applied to the syntactic analyses of LASSY Small.
arXiv Detail & Related papers (2019-12-29T11:31:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.