A Distributed Automatic Domain-Specific Multi-Word Term Recognition
Architecture using Spark Ecosystem
- URL: http://arxiv.org/abs/2305.16343v1
- Date: Wed, 24 May 2023 10:05:59 GMT
- Title: A Distributed Automatic Domain-Specific Multi-Word Term Recognition
Architecture using Spark Ecosystem
- Authors: Ciprian-Octavian Truic\u{a} and Neculai-Ovidiu Istrate and
Elena-Simona Apostol
- Abstract summary: We propose a distributed Spark-based architecture to automatically extract domain-specific terms.
We prove empirically the feasibility of our architecture by performing experiments on two real-world datasets.
- Score: 0.5156484100374059
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Automatic Term Recognition is used to extract domain-specific terms that
belong to a given domain. In order to be accurate, these corpus and
language-dependent methods require large volumes of textual data that need to
be processed to extract candidate terms that are afterward scored according to
a given metric. To improve text preprocessing and candidate terms extraction
and scoring, we propose a distributed Spark-based architecture to automatically
extract domain-specific terms. The main contributions are as follows: (1)
propose a novel distributed automatic domain-specific multi-word term
recognition architecture built on top of the Spark ecosystem; (2) perform an
in-depth analysis of our architecture in terms of accuracy and scalability; (3)
design an easy-to-integrate Python implementation that enables the use of Big
Data processing in fields such as Computational Linguistics and Natural
Language Processing. We prove empirically the feasibility of our architecture
by performing experiments on two real-world datasets.
Related papers
- Deep Learning and Natural Language Processing in the Field of Construction [0.09208007322096533]
We first describe the corpus analysis method to extract terminology from a collection of technical specifications in the field of construction.
We then perform pruning steps with linguistic patterns and internet queries to improve the quality of the final terminology.
Second, we present a machine-learning approach based on various words embedding models and combinations to deal with the detection of hypernyms from the extracted terminology.
arXiv Detail & Related papers (2025-01-14T07:53:44Z) - Automated Collection of Evaluation Dataset for Semantic Search in Low-Resource Domain Language [4.5224851085910585]
Domain-specific languages that use a lot of specific terminology often fall into the category of low-resource languages.
This study addresses the challenge of automated collecting test datasets to evaluate semantic search in low-resource domain-specific German language.
arXiv Detail & Related papers (2024-12-13T09:47:26Z) - DORIC : Domain Robust Fine-Tuning for Open Intent Clustering through
Dependency Parsing [14.709084509818474]
DSTC11-Track2 aims to provide a benchmark for zero-shot, cross-domain, intent-set induction.
We leveraged a multi-domain dialogue dataset to fine-tune the language model and proposed extracting Verb-Object pairs.
Our approach achieved 3rd place in the precision score and showed superior accuracy and normalized mutual information (NMI) score than the baseline model.
arXiv Detail & Related papers (2023-03-17T08:12:36Z) - Extracting Domain-specific Concepts from Large-scale Linked Open Data [0.0]
The proposed method defines search entities by linking the LOD vocabulary with terms related to the target domain.
The occurrences of common upper-level entities and the chain-of-path relationships are examined to determine the range of conceptual connections in the target domain.
arXiv Detail & Related papers (2021-11-22T10:25:57Z) - Seed Words Based Data Selection for Language Model Adaptation [11.59717828860318]
We present an approach for automatically selecting sentences, from a text corpus, that match, both semantically and morphologically, a glossary of terms furnished by the user.
The vocabulary of the baseline model is expanded and tailored, reducing the resulting OOV rate.
Results using different metrics (OOV rate, WER, precision and recall) show the effectiveness of the proposed techniques.
arXiv Detail & Related papers (2021-07-20T12:08:27Z) - Text Summarization with Latent Queries [60.468323530248945]
We introduce LaQSum, the first unified text summarization system that learns Latent Queries from documents for abstractive summarization with any existing query forms.
Under a deep generative framework, our system jointly optimize a latent query model and a conditional language model, allowing users to plug-and-play queries of any type at test time.
Our system robustly outperforms strong comparison systems across summarization benchmarks with different query types, document settings, and target domains.
arXiv Detail & Related papers (2021-05-31T21:14:58Z) - Inferring Latent Domains for Unsupervised Deep Domain Adaptation [54.963823285456925]
Unsupervised Domain Adaptation (UDA) refers to the problem of learning a model in a target domain where labeled data are not available.
This paper introduces a novel deep architecture which addresses the problem of UDA by automatically discovering latent domains in visual datasets.
We evaluate our approach on publicly available benchmarks, showing that it outperforms state-of-the-art domain adaptation methods.
arXiv Detail & Related papers (2021-03-25T14:33:33Z) - Simple multi-dataset detection [83.9604523643406]
We present a simple method for training a unified detector on multiple large-scale datasets.
We show how to automatically integrate dataset-specific outputs into a common semantic taxonomy.
Our approach does not require manual taxonomy reconciliation.
arXiv Detail & Related papers (2021-02-25T18:55:58Z) - Batch Normalization Embeddings for Deep Domain Generalization [50.51405390150066]
Domain generalization aims at training machine learning models to perform robustly across different and unseen domains.
We show a significant increase in classification accuracy over current state-of-the-art techniques on popular domain generalization benchmarks.
arXiv Detail & Related papers (2020-11-25T12:02:57Z) - Text Recognition in Real Scenarios with a Few Labeled Samples [55.07859517380136]
Scene text recognition (STR) is still a hot research topic in computer vision field.
This paper proposes a few-shot adversarial sequence domain adaptation (FASDA) approach to build sequence adaptation.
Our approach can maximize the character-level confusion between the source domain and the target domain.
arXiv Detail & Related papers (2020-06-22T13:03:01Z) - Unsupervised Domain Clusters in Pretrained Language Models [61.832234606157286]
We show that massive pre-trained language models implicitly learn sentence representations that cluster by domains without supervision.
We propose domain data selection methods based on such models.
We evaluate our data selection methods for neural machine translation across five diverse domains.
arXiv Detail & Related papers (2020-04-05T06:22:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.