SPaR.txt, a cheap Shallow Parsing approach for Regulatory texts
- URL: http://arxiv.org/abs/2110.01295v1
- Date: Mon, 4 Oct 2021 10:00:22 GMT
- Title: SPaR.txt, a cheap Shallow Parsing approach for Regulatory texts
- Authors: Ruben Kruiper, Ioannis Konstas, Alasdair Gray, Farhad Sadeghineko,
Richard Watson and Bimal Kumar
- Abstract summary: This study introduces a shallow parsing task for which training data is relatively cheap to create.
We show through manual evaluation that the model identifies most (89,84%) defined terms in a set of building regulation documents.
- Score: 6.656036869700669
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automated Compliance Checking (ACC) systems aim to semantically parse
building regulations to a set of rules. However, semantic parsing is known to
be hard and requires large amounts of training data. The complexity of creating
such training data has led to research that focuses on small sub-tasks, such as
shallow parsing or the extraction of a limited subset of rules. This study
introduces a shallow parsing task for which training data is relatively cheap
to create, with the aim of learning a lexicon for ACC. We annotate a small
domain-specific dataset of 200 sentences, SPaR.txt, and train a sequence tagger
that achieves 79,93 F1-score on the test set. We then show through manual
evaluation that the model identifies most (89,84%) defined terms in a set of
building regulation documents, and that both contiguous and discontiguous
Multi-Word Expressions (MWE) are discovered with reasonable accuracy (70,3%).
Related papers
- Automated Collection of Evaluation Dataset for Semantic Search in Low-Resource Domain Language [4.5224851085910585]
Domain-specific languages that use a lot of specific terminology often fall into the category of low-resource languages.
This study addresses the challenge of automated collecting test datasets to evaluate semantic search in low-resource domain-specific German language.
arXiv Detail & Related papers (2024-12-13T09:47:26Z) - The Power of Summary-Source Alignments [62.76959473193149]
Multi-document summarization (MDS) is a challenging task, often decomposed to subtasks of salience and redundancy detection.
alignment of corresponding sentences between a reference summary and its source documents has been leveraged to generate training data.
This paper proposes extending the summary-source alignment framework by applying it at the more fine-grained proposition span level.
arXiv Detail & Related papers (2024-06-02T19:35:19Z) - Sequence-to-sequence models in peer-to-peer learning: A practical application [0.0]
The paper explores the applicability of sequence-to-sequence (Seq2Seq) models based on LSTM units for Automatic Speech Recognition (ASR) task within peer-to-peer learning environments.
The findings demonstrate the feasibility of employing Seq2Seq models in decentralized settings.
arXiv Detail & Related papers (2024-05-02T14:44:06Z) - Summarization-based Data Augmentation for Document Classification [16.49709049899731]
We propose a simple yet effective summarization-based data augmentation, SUMMaug, for document classification.
We first obtain easy-to-learn examples for the target document classification task.
We then use the generated pseudo examples to perform curriculum learning.
arXiv Detail & Related papers (2023-12-01T11:34:37Z) - In-context Pretraining: Language Modeling Beyond Document Boundaries [137.53145699439898]
In-Context Pretraining is a new approach where language models are pretrained on a sequence of related documents.
We introduce approximate algorithms for finding related documents with efficient nearest neighbor search.
We see notable improvements in tasks that require more complex contextual reasoning.
arXiv Detail & Related papers (2023-10-16T17:57:12Z) - M-Tuning: Prompt Tuning with Mitigated Label Bias in Open-Set Scenarios [58.617025733655005]
We propose a vision-language prompt tuning method with mitigated label bias (M-Tuning)
It introduces open words from the WordNet to extend the range of words forming the prompt texts from only closed-set label words to more, and thus prompts are tuned in a simulated open-set scenario.
Our method achieves the best performance on datasets with various scales, and extensive ablation studies also validate its effectiveness.
arXiv Detail & Related papers (2023-03-09T09:05:47Z) - One Embedder, Any Task: Instruction-Finetuned Text Embeddings [105.82772523968961]
INSTRUCTOR is a new method for computing text embeddings given task instructions.
Every text input is embedded together with instructions explaining the use case.
We evaluate INSTRUCTOR on 70 embedding evaluation tasks.
arXiv Detail & Related papers (2022-12-19T18:57:05Z) - Training Naturalized Semantic Parsers with Very Little Data [10.709587018625275]
State-of-the-art (SOTA) semantics are seq2seq architectures based on large language models that have been pretrained on vast amounts of text.
Recent work has explored a reformulation of semantic parsing whereby the output sequences are themselves natural language sentences.
We show that this method delivers new SOTA few-shot performance on the Overnight dataset.
arXiv Detail & Related papers (2022-04-29T17:14:54Z) - Multitasking Framework for Unsupervised Simple Definition Generation [5.2221935174520056]
We propose a novel task of Simple Definition Generation to help language learners and low literacy readers.
A significant challenge of this task is the lack of learner's dictionaries in many languages.
We propose a multitasking framework SimpDefiner that only requires a standard dictionary with complex definitions and a corpus containing arbitrary simple texts.
arXiv Detail & Related papers (2022-03-24T08:16:04Z) - LeQua@CLEF2022: Learning to Quantify [76.22817970624875]
LeQua 2022 is a new lab for the evaluation of methods for learning to quantify'' in textual datasets.
The goal of this lab is to provide a setting for the comparative evaluation of methods for learning to quantify, both in the binary setting and in the single-label multiclass setting.
arXiv Detail & Related papers (2021-11-22T14:54:20Z) - Document-Level Text Simplification: Dataset, Criteria and Baseline [75.58761130635824]
We define and investigate a new task of document-level text simplification.
Based on Wikipedia dumps, we first construct a large-scale dataset named D-Wikipedia.
We propose a new automatic evaluation metric called D-SARI that is more suitable for the document-level simplification task.
arXiv Detail & Related papers (2021-10-11T08:15:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.