An Exploratory Study on Utilising the Web of Linked Data for Product
Data Mining
- URL: http://arxiv.org/abs/2109.01411v1
- Date: Fri, 3 Sep 2021 09:58:36 GMT
- Title: An Exploratory Study on Utilising the Web of Linked Data for Product
Data Mining
- Authors: Ziqi Zhang, Xingyi Song
- Abstract summary: This work focuses on the e-commerce domain to explore methods of utilising structured data to create language resources that may be used for product classification and linking.
We process billions of structured data points in the form of RDF n-quads, to create multi-million words of product-related corpora that are later used in three different ways for creating of language resources.
Our evaluation on an extensive set of benchmarks shows word embeddings to be the most reliable and consistent method to improve the accuracy on both tasks.
- Score: 3.7376948366228175
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Linked Open Data practice has led to a significant growth of structured
data on the Web in the last decade. Such structured data describe real-world
entities in a machine-readable way, and have created an unprecedented
opportunity for research in the field of Natural Language Processing. However,
there is a lack of studies on how such data can be used, for what kind of
tasks, and to what extent they can be useful for these tasks. This work focuses
on the e-commerce domain to explore methods of utilising such structured data
to create language resources that may be used for product classification and
linking. We process billions of structured data points in the form of RDF
n-quads, to create multi-million words of product-related corpora that are
later used in three different ways for creating of language resources: training
word embedding models, continued pre-training of BERT-like language models, and
training Machine Translation models that are used as a proxy to generate
product-related keywords. Our evaluation on an extensive set of benchmarks
shows word embeddings to be the most reliable and consistent method to improve
the accuracy on both tasks (with up to 6.9 percentage points in macro-average
F1 on some datasets). The other two methods however, are not as useful. Our
analysis shows that this could be due to a number of reasons, including the
biased domain representation in the structured data and lack of vocabulary
coverage. We share our datasets and discuss how our lessons learned could be
taken forward to inform future research in this direction.
Related papers
- Language Modeling on Tabular Data: A Survey of Foundations, Techniques and Evolution [7.681258910515419]
Tabular data presents unique challenges due to its heterogeneous nature and complex structural relationships.
High predictive performance and robustness in tabular data analysis holds significant promise for numerous applications.
The recent advent of large language models, such as GPT and LLaMA, has further revolutionized the field, facilitating more advanced and diverse applications with minimal fine-tuning.
arXiv Detail & Related papers (2024-08-20T04:59:19Z) - Query of CC: Unearthing Large Scale Domain-Specific Knowledge from
Public Corpora [104.16648246740543]
We propose an efficient data collection method based on large language models.
The method bootstraps seed information through a large language model and retrieves related data from public corpora.
It not only collects knowledge-related data for specific domains but unearths the data with potential reasoning procedures.
arXiv Detail & Related papers (2024-01-26T03:38:23Z) - Capture the Flag: Uncovering Data Insights with Large Language Models [90.47038584812925]
This study explores the potential of using Large Language Models (LLMs) to automate the discovery of insights in data.
We propose a new evaluation methodology based on a "capture the flag" principle, measuring the ability of such models to recognize meaningful and pertinent information (flags) in a dataset.
arXiv Detail & Related papers (2023-12-21T14:20:06Z) - Faithful Low-Resource Data-to-Text Generation through Cycle Training [14.375070014155817]
Methods to generate text from structured data have advanced significantly in recent years.
Cycle training uses two models which are inverses of each other.
We show that cycle training achieves nearly the same performance as fully supervised approaches.
arXiv Detail & Related papers (2023-05-24T06:44:42Z) - An Open Dataset and Model for Language Identification [84.15194457400253]
We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033 across 201 languages.
We make both the model and the dataset available to the research community.
arXiv Detail & Related papers (2023-05-23T08:43:42Z) - What's in a Name? Evaluating Assembly-Part Semantic Knowledge in
Language Models through User-Provided Names in CAD Files [4.387757291346397]
We propose that the natural language names designers use in Computer Aided Design (CAD) software are a valuable source of such knowledge.
In particular we extract and clean a large corpus of natural language part, feature and document names.
We show that fine-tuning on the text data corpus further boosts the performance on all tasks, thus demonstrating the value of the text data.
arXiv Detail & Related papers (2023-04-25T12:30:01Z) - AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators [98.11286353828525]
GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks.
We propose AnnoLLM, which adopts a two-step approach, explain-then-annotate.
We build the first conversation-based information retrieval dataset employing AnnoLLM.
arXiv Detail & Related papers (2023-03-29T17:03:21Z) - Annotated Dataset Creation through General Purpose Language Models for
non-English Medical NLP [0.5482532589225552]
In our work we suggest to leverage pretrained language models for training data acquisition.
We create a custom dataset which we use to train a medical NER model for German texts, GPTNERMED.
arXiv Detail & Related papers (2022-08-30T18:42:55Z) - What Makes Data-to-Text Generation Hard for Pretrained Language Models? [17.07349898176898]
Expressing natural language descriptions of structured facts or relations -- data-to-text generation (D2T) -- increases the accessibility of structured knowledge repositories.
Previous work shows that pre-trained language models(PLMs) perform remarkably well on this task after fine-tuning on a significant amount of task-specific training data.
We conduct an empirical study of both fine-tuned and auto-regressive PLMs on the DART multi-domain D2T dataset.
arXiv Detail & Related papers (2022-05-23T17:58:39Z) - Efficient Nearest Neighbor Language Models [114.40866461741795]
Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore.
We show how to achieve up to a 6x speed-up in inference speed while retaining comparable performance.
arXiv Detail & Related papers (2021-09-09T12:32:28Z) - Learning to Synthesize Data for Semantic Parsing [57.190817162674875]
We propose a generative model which models the composition of programs and maps a program to an utterance.
Due to the simplicity of PCFG and pre-trained BART, our generative model can be efficiently learned from existing data at hand.
We evaluate our method in both in-domain and out-of-domain settings of text-to-Query parsing on the standard benchmarks of GeoQuery and Spider.
arXiv Detail & Related papers (2021-04-12T21:24:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.