Novel Preprocessing Technique for Data Embedding in Engineering Code
Generation Using Large Language Model
- URL: http://arxiv.org/abs/2311.16267v2
- Date: Tue, 30 Jan 2024 08:00:07 GMT
- Title: Novel Preprocessing Technique for Data Embedding in Engineering Code
Generation Using Large Language Model
- Authors: Yu-Chen Lin, Akhilesh Kumar, Norman Chang, Wenliang Zhang, Muhammad
Zakir, Rucha Apte, Haiyang He, Chao Wang, Jyh-Shing Roger Jang
- Abstract summary: We present four main contributions to enhance the performance of Large Language Models (LLMs) in generating domain-specific code.
We introduce the Chain of Density for Renovation Credibility (CoDRC), driven by LLMs, and the Adaptive Text Renovation (ATR) algorithm for assessing data reliability.
We also develop the Implicit Knowledge Expansion and Contemplation (IKEC) Prompt technique.
- Score: 7.74830226656449
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present four main contributions to enhance the performance of Large
Language Models (LLMs) in generating domain-specific code: (i) utilizing
LLM-based data splitting and data renovation techniques to improve the semantic
representation of embeddings' space; (ii) introducing the Chain of Density for
Renovation Credibility (CoDRC), driven by LLMs, and the Adaptive Text
Renovation (ATR) algorithm for assessing data renovation reliability; (iii)
developing the Implicit Knowledge Expansion and Contemplation (IKEC) Prompt
technique; and (iv) effectively refactoring existing scripts to generate new
and high-quality scripts with LLMs. By using engineering simulation software
RedHawk-SC as a case study, we demonstrate the effectiveness of our data
pre-processing method for expanding and categorizing scripts. When combined
with IKEC, these techniques enhance the Retrieval-Augmented Generation (RAG)
method in retrieving more relevant information, ultimately achieving a 73.33%
"Percentage of Correct Lines" for code generation problems in MapReduce
applications.
Related papers
- Training LLMs for Generating IEC 61131-3 Structured Text with Online Feedback [0.0]
This paper proposes a novel approach to training large language models (LLMs) that emphasizes improving the quality of learning data.
The framework proves highly suitable for industrial automation applications and outperforms state-of-the-art models.
arXiv Detail & Related papers (2024-10-29T15:54:09Z) - On The Role of Prompt Construction In Enhancing Efficacy and Efficiency of LLM-Based Tabular Data Generation [16.79923685316516]
We explore three prompt construction protocols: Expert-guided, LLM-guided, and Novel-Mapping.
We find that context-enriched prompts lead to significantly improved data generation quality and training efficiency.
arXiv Detail & Related papers (2024-09-06T00:02:09Z) - SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation.
Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z) - Code Less, Align More: Efficient LLM Fine-tuning for Code Generation with Data Pruning [4.975728472540823]
We present techniques that integrate various clustering and pruning metrics to selectively reduce training data without compromising the accuracy and functionality of the generated code.
Our experiments show that these pruning strategies not only reduce the computational resources needed but also enhance the overall quality code generation.
arXiv Detail & Related papers (2024-07-06T10:30:43Z) - Improving Retrieval for RAG based Question Answering Models on Financial Documents [0.046603287532620746]
This paper explores the existing constraints of RAG pipelines and introduces methodologies for enhancing text retrieval.
It delves into strategies such as sophisticated chunking techniques, query expansion, the incorporation of metadata annotations, the application of re-ranking algorithms, and the fine-tuning of embedding algorithms.
arXiv Detail & Related papers (2024-03-23T00:49:40Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark [166.40879020706151]
This paper proposes a shift towards BP-free, zeroth-order (ZO) optimization as a solution for reducing memory costs during fine-tuning.
Unlike traditional ZO-SGD methods, our work expands the exploration to a wider array of ZO optimization techniques.
Our study unveils previously overlooked optimization principles, highlighting the importance of task alignment, the role of the forward gradient method, and the balance between algorithm complexity and fine-tuning performance.
arXiv Detail & Related papers (2024-02-18T14:08:48Z) - Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in low-data regimes [57.62036621319563]
We introduce CLLM, which leverages the prior knowledge of Large Language Models (LLMs) for data augmentation in the low-data regime.
We demonstrate the superior performance of CLLM in the low-data regime compared to conventional generators.
arXiv Detail & Related papers (2023-12-19T12:34:46Z) - LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system.
We build a novel data-cleaning pipeline that uses these principles to transform existing programs.
We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z) - Large Language Models can Contrastively Refine their Generation for Better Sentence Representation Learning [57.74233319453229]
Large language models (LLMs) have emerged as a groundbreaking technology and their unparalleled text generation capabilities have sparked interest in their application to the fundamental sentence representation learning task.
We propose MultiCSR, a multi-level contrastive sentence representation learning framework that decomposes the process of prompting LLMs to generate a corpus.
Our experiments reveal that MultiCSR enables a less advanced LLM to surpass the performance of ChatGPT, while applying it to ChatGPT achieves better state-of-the-art results.
arXiv Detail & Related papers (2023-10-17T03:21:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.