What's in a Name? Evaluating Assembly-Part Semantic Knowledge in
Language Models through User-Provided Names in CAD Files
- URL: http://arxiv.org/abs/2304.14275v1
- Date: Tue, 25 Apr 2023 12:30:01 GMT
- Title: What's in a Name? Evaluating Assembly-Part Semantic Knowledge in
Language Models through User-Provided Names in CAD Files
- Authors: Peter Meltzer, Joseph G. Lambourne, Daniele Grandi
- Abstract summary: We propose that the natural language names designers use in Computer Aided Design (CAD) software are a valuable source of such knowledge.
In particular we extract and clean a large corpus of natural language part, feature and document names.
We show that fine-tuning on the text data corpus further boosts the performance on all tasks, thus demonstrating the value of the text data.
- Score: 4.387757291346397
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Semantic knowledge of part-part and part-whole relationships in assemblies is
useful for a variety of tasks from searching design repositories to the
construction of engineering knowledge bases. In this work we propose that the
natural language names designers use in Computer Aided Design (CAD) software
are a valuable source of such knowledge, and that Large Language Models (LLMs)
contain useful domain-specific information for working with this data as well
as other CAD and engineering-related tasks.
In particular we extract and clean a large corpus of natural language part,
feature and document names and use this to quantitatively demonstrate that a
pre-trained language model can outperform numerous benchmarks on three
self-supervised tasks, without ever having seen this data before. Moreover, we
show that fine-tuning on the text data corpus further boosts the performance on
all tasks, thus demonstrating the value of the text data which until now has
been largely ignored. We also identify key limitations to using LLMs with text
data alone, and our findings provide a strong motivation for further work into
multi-modal text-geometry models.
To aid and encourage further work in this area we make all our data and code
publicly available.
Related papers
- DataAgent: Evaluating Large Language Models' Ability to Answer Zero-Shot, Natural Language Queries [0.0]
We evaluate OpenAI's GPT-3.5 as a "Language Data Scientist" (LDS)
The model was tested on a diverse set of benchmark datasets to evaluate its performance across multiple standards.
arXiv Detail & Related papers (2024-03-29T22:59:34Z) - Query of CC: Unearthing Large Scale Domain-Specific Knowledge from
Public Corpora [104.16648246740543]
We propose an efficient data collection method based on large language models.
The method bootstraps seed information through a large language model and retrieves related data from public corpora.
It not only collects knowledge-related data for specific domains but unearths the data with potential reasoning procedures.
arXiv Detail & Related papers (2024-01-26T03:38:23Z) - Instruct and Extract: Instruction Tuning for On-Demand Information
Extraction [86.29491354355356]
On-Demand Information Extraction aims to fulfill the personalized demands of real-world users.
We present a benchmark named InstructIE, inclusive of both automatically generated training data, as well as the human-annotated test set.
Building on InstructIE, we further develop an On-Demand Information Extractor, ODIE.
arXiv Detail & Related papers (2023-10-24T17:54:25Z) - RET-LLM: Towards a General Read-Write Memory for Large Language Models [53.288356721954514]
RET-LLM is a novel framework that equips large language models with a general write-read memory unit.
Inspired by Davidsonian semantics theory, we extract and save knowledge in the form of triplets.
Our framework exhibits robust performance in handling temporal-based question answering tasks.
arXiv Detail & Related papers (2023-05-23T17:53:38Z) - XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented
Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems.
We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot.
XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z) - AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators [98.11286353828525]
GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks.
We propose AnnoLLM, which adopts a two-step approach, explain-then-annotate.
We build the first conversation-based information retrieval dataset employing AnnoLLM.
arXiv Detail & Related papers (2023-03-29T17:03:21Z) - Unified Text Structuralization with Instruction-tuned Language Models [28.869098023025753]
We propose a simple and efficient approach to instruct large language model (LLM) to extract a variety of structures from texts.
Experiments show that this approach can enable language models to perform comparable with other state-of-the-art methods on datasets of a variety of languages and knowledge.
arXiv Detail & Related papers (2023-03-27T07:39:05Z) - Knowledge Based Multilingual Language Model [44.70205282863062]
We present a novel framework to pretrain knowledge based multilingual language models (KMLMs)
We generate a large amount of code-switched synthetic sentences and reasoning-based multilingual training data using the Wikidata knowledge graphs.
Based on the intra- and inter-sentence structures of the generated data, we design pretraining tasks to facilitate knowledge learning.
arXiv Detail & Related papers (2021-11-22T02:56:04Z) - An Exploratory Study on Utilising the Web of Linked Data for Product
Data Mining [3.7376948366228175]
This work focuses on the e-commerce domain to explore methods of utilising structured data to create language resources that may be used for product classification and linking.
We process billions of structured data points in the form of RDF n-quads, to create multi-million words of product-related corpora that are later used in three different ways for creating of language resources.
Our evaluation on an extensive set of benchmarks shows word embeddings to be the most reliable and consistent method to improve the accuracy on both tasks.
arXiv Detail & Related papers (2021-09-03T09:58:36Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - Quda: Natural Language Queries for Visual Data Analytics [33.983060903399554]
We present a new dataset, called Quda, that aims to help V-NLIs recognize analytic tasks from free-form natural language.
Our dataset contains $14,035$ diverse user queries, and each is annotated with one or multiple analytic tasks.
This work is the first attempt to construct a large-scale corpus for recognizing analytic tasks.
arXiv Detail & Related papers (2020-05-07T05:35:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.