Related papers: A Distributed Automatic Domain-Specific Multi-Word Term Recognition Architecture using Spark Ecosystem

A Distributed Automatic Domain-Specific Multi-Word Term Recognition Architecture using Spark Ecosystem

URL: http://arxiv.org/abs/2305.16343v1
Date: Wed, 24 May 2023 10:05:59 GMT
Title: A Distributed Automatic Domain-Specific Multi-Word Term Recognition Architecture using Spark Ecosystem
Authors: Ciprian-Octavian Truic\u{a} and Neculai-Ovidiu Istrate and Elena-Simona Apostol
Abstract summary: We propose a distributed Spark-based architecture to automatically extract domain-specific terms. We prove empirically the feasibility of our architecture by performing experiments on two real-world datasets.
Score: 0.5156484100374059
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Automatic Term Recognition is used to extract domain-specific terms that belong to a given domain. In order to be accurate, these corpus and language-dependent methods require large volumes of textual data that need to be processed to extract candidate terms that are afterward scored according to a given metric. To improve text preprocessing and candidate terms extraction and scoring, we propose a distributed Spark-based architecture to automatically extract domain-specific terms. The main contributions are as follows: (1) propose a novel distributed automatic domain-specific multi-word term recognition architecture built on top of the Spark ecosystem; (2) perform an in-depth analysis of our architecture in terms of accuracy and scalability; (3) design an easy-to-integrate Python implementation that enables the use of Big Data processing in fields such as Computational Linguistics and Natural Language Processing. We prove empirically the feasibility of our architecture by performing experiments on two real-world datasets.

Related papers

Deep Learning and Natural Language Processing in the Field of Construction [0.09208007322096533]
We first describe the corpus analysis method to extract terminology from a collection of technical specifications in the field of construction. We then perform pruning steps with linguistic patterns and internet queries to improve the quality of the final terminology. Second, we present a machine-learning approach based on various words embedding models and combinations to deal with the detection of hypernyms from the extracted terminology.
arXiv Detail & Related papers (2025-01-14T07:53:44Z)
Automated Collection of Evaluation Dataset for Semantic Search in Low-Resource Domain Language [4.5224851085910585]
Domain-specific languages that use a lot of specific terminology often fall into the category of low-resource languages. This study addresses the challenge of automated collecting test datasets to evaluate semantic search in low-resource domain-specific German language.
arXiv Detail & Related papers (2024-12-13T09:47:26Z)
DORIC : Domain Robust Fine-Tuning for Open Intent Clustering through Dependency Parsing [14.709084509818474]
DSTC11-Track2 aims to provide a benchmark for zero-shot, cross-domain, intent-set induction. We leveraged a multi-domain dialogue dataset to fine-tune the language model and proposed extracting Verb-Object pairs. Our approach achieved 3rd place in the precision score and showed superior accuracy and normalized mutual information (NMI) score than the baseline model.
arXiv Detail & Related papers (2023-03-17T08:12:36Z)
A Machine Learning Approach to Classifying Construction Cost Documents into the International Construction Measurement Standard [0.0]
We introduce the first automated models for classifying natural language descriptions provided in cost documents called "Bills of Quantities" We learn from a dataset of more than 50 thousand descriptions of items retrieved from 24 large infrastructure construction projects across the United Kingdom.
arXiv Detail & Related papers (2022-10-24T11:35:53Z)
Extracting Domain-specific Concepts from Large-scale Linked Open Data [0.0]
The proposed method defines search entities by linking the LOD vocabulary with terms related to the target domain. The occurrences of common upper-level entities and the chain-of-path relationships are examined to determine the range of conceptual connections in the target domain.
arXiv Detail & Related papers (2021-11-22T10:25:57Z)
Seed Words Based Data Selection for Language Model Adaptation [11.59717828860318]
We present an approach for automatically selecting sentences, from a text corpus, that match, both semantically and morphologically, a glossary of terms furnished by the user. The vocabulary of the baseline model is expanded and tailored, reducing the resulting OOV rate. Results using different metrics (OOV rate, WER, precision and recall) show the effectiveness of the proposed techniques.
arXiv Detail & Related papers (2021-07-20T12:08:27Z)
Can BERT Dig It? -- Named Entity Recognition for Information Retrieval in the Archaeology Domain [3.928604516640069]
ArcheoBERTje is a BERT model pre-trained on Dutch archaeological texts. We analyse the differences between the vocabulary and output of the BERT models on the full collection.
arXiv Detail & Related papers (2021-06-14T20:26:19Z)
Text Summarization with Latent Queries [60.468323530248945]
We introduce LaQSum, the first unified text summarization system that learns Latent Queries from documents for abstractive summarization with any existing query forms. Under a deep generative framework, our system jointly optimize a latent query model and a conditional language model, allowing users to plug-and-play queries of any type at test time. Our system robustly outperforms strong comparison systems across summarization benchmarks with different query types, document settings, and target domains.
arXiv Detail & Related papers (2021-05-31T21:14:58Z)
Inferring Latent Domains for Unsupervised Deep Domain Adaptation [54.963823285456925]
Unsupervised Domain Adaptation (UDA) refers to the problem of learning a model in a target domain where labeled data are not available. This paper introduces a novel deep architecture which addresses the problem of UDA by automatically discovering latent domains in visual datasets. We evaluate our approach on publicly available benchmarks, showing that it outperforms state-of-the-art domain adaptation methods.
arXiv Detail & Related papers (2021-03-25T14:33:33Z)
Simple multi-dataset detection [83.9604523643406]
We present a simple method for training a unified detector on multiple large-scale datasets. We show how to automatically integrate dataset-specific outputs into a common semantic taxonomy. Our approach does not require manual taxonomy reconciliation.
arXiv Detail & Related papers (2021-02-25T18:55:58Z)
Batch Normalization Embeddings for Deep Domain Generalization [50.51405390150066]
Domain generalization aims at training machine learning models to perform robustly across different and unseen domains. We show a significant increase in classification accuracy over current state-of-the-art techniques on popular domain generalization benchmarks.
arXiv Detail & Related papers (2020-11-25T12:02:57Z)
Text Recognition in Real Scenarios with a Few Labeled Samples [55.07859517380136]
Scene text recognition (STR) is still a hot research topic in computer vision field. This paper proposes a few-shot adversarial sequence domain adaptation (FASDA) approach to build sequence adaptation. Our approach can maximize the character-level confusion between the source domain and the target domain.
arXiv Detail & Related papers (2020-06-22T13:03:01Z)
Unsupervised Domain Clusters in Pretrained Language Models [61.832234606157286]
We show that massive pre-trained language models implicitly learn sentence representations that cluster by domains without supervision. We propose domain data selection methods based on such models. We evaluate our data selection methods for neural machine translation across five diverse domains.
arXiv Detail & Related papers (2020-04-05T06:22:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.