Unearthing Large Scale Domain-Specific Knowledge from Public Corpora
- URL: http://arxiv.org/abs/2401.14624v4
- Date: Sat, 24 May 2025 06:22:30 GMT
- Title: Unearthing Large Scale Domain-Specific Knowledge from Public Corpora
- Authors: Zhaoye Fei, Yunfan Shao, Linyang Li, Zhiyuan Zeng, Conghui He, Qipeng Guo, Hang Yan, Dahua Lin, Xipeng Qiu,
- Abstract summary: We introduce large models into the data collection pipeline to guide the generation of domain-specific information.<n>We refer to this approach as Retrieve-from-CC.<n>It not only collects data related to domain-specific knowledge but also mines the data containing potential reasoning procedures from the public corpus.
- Score: 103.0865116794534
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have demonstrated remarkable potential in various tasks, however, there remains a significant lack of open-source models and data for specific domains. Previous work has primarily focused on manually specifying resources and collecting high-quality data for specific domains, which is extremely time-consuming and labor-intensive. To address this limitation, we introduce large models into the data collection pipeline to guide the generation of domain-specific information and retrieve relevant data from Common Crawl (CC), a large public corpus. We refer to this approach as Retrieve-from-CC. It not only collects data related to domain-specific knowledge but also mines the data containing potential reasoning procedures from the public corpus. By applying this method, we have collected a knowledge domain-related dataset named Retrieve-Pile, which covers four main domains, including the sciences, humanities, and other categories. Through the analysis of , Retrieve-from-CC can effectively retrieve relevant data from the covered knowledge domains and significantly improve the performance in tests of mathematical and knowledge-related reasoning abilities. We have released Retrieve-Pile at https://huggingface.co/datasets/Query-of-CC/Retrieve-Pile.
Related papers
- RedStone: Curating General, Code, Math, and QA Data for Large Language Models [134.49774529790693]
This study explores the untapped potential of Common Crawl as a comprehensive and flexible resource for pre-training Large Language Models.
We introduce RedStone, an innovative and scalable pipeline engineered to extract and process data from Common Crawl.
arXiv Detail & Related papers (2024-12-04T15:27:39Z) - A New Pipeline For Generating Instruction Dataset via RAG and Self Fine-Tuning [0.0]
This research proposes a pipeline to construct high-quality instruction datasets for fine-tuning on specific domains.
By ingesting domain-specific documents, the pipeline generates relevant and contextually appropriate instructions.
As a case study, we apply this approach to the domain of psychiatry, a field requiring specialized knowledge and sensitive handling of patient information.
arXiv Detail & Related papers (2024-08-12T03:52:11Z) - Adapting to Distribution Shift by Visual Domain Prompt Generation [34.19066857066073]
We adapt a model at test-time using a few unlabeled data to address distribution shifts.
We build a knowledge bank to learn the transferable knowledge from source domains.
The proposed method outperforms previous work on 5 large-scale benchmarks including WILDS and DomainNet.
arXiv Detail & Related papers (2024-05-05T02:44:04Z) - DataAgent: Evaluating Large Language Models' Ability to Answer Zero-Shot, Natural Language Queries [0.0]
We evaluate OpenAI's GPT-3.5 as a "Language Data Scientist" (LDS)
The model was tested on a diverse set of benchmark datasets to evaluate its performance across multiple standards.
arXiv Detail & Related papers (2024-03-29T22:59:34Z) - ACLSum: A New Dataset for Aspect-based Summarization of Scientific
Publications [10.529898520273063]
ACLSum is a novel summarization dataset carefully crafted and evaluated by domain experts.
In contrast to previous datasets, ACLSum facilitates multi-aspect summarization of scientific papers.
arXiv Detail & Related papers (2024-03-08T13:32:01Z) - Clue-Guided Path Exploration: Optimizing Knowledge Graph Retrieval with Large Language Models to Address the Information Black Box Challenge [19.40489486138002]
We propose a Clue-Guided Path Exploration (CGPE) framework to optimize knowledge retrieval based on large language models.
Experiments on open-source datasets reveal that CGPE outperforms previous methods and is highly applicable to LLMs with fewer parameters.
arXiv Detail & Related papers (2024-01-24T13:36:50Z) - Capture the Flag: Uncovering Data Insights with Large Language Models [90.47038584812925]
This study explores the potential of using Large Language Models (LLMs) to automate the discovery of insights in data.
We propose a new evaluation methodology based on a "capture the flag" principle, measuring the ability of such models to recognize meaningful and pertinent information (flags) in a dataset.
arXiv Detail & Related papers (2023-12-21T14:20:06Z) - Knowledge Plugins: Enhancing Large Language Models for Domain-Specific
Recommendations [50.81844184210381]
We propose a general paradigm that augments large language models with DOmain-specific KnowledgE to enhance their performance on practical applications, namely DOKE.
This paradigm relies on a domain knowledge extractor, working in three steps: 1) preparing effective knowledge for the task; 2) selecting the knowledge for each specific sample; and 3) expressing the knowledge in an LLM-understandable way.
arXiv Detail & Related papers (2023-11-16T07:09:38Z) - Utilising a Large Language Model to Annotate Subject Metadata: A Case
Study in an Australian National Research Data Catalogue [18.325675189960833]
In support of open and reproducible research, there has been a rapidly increasing number of datasets made available for research.
As the availability of datasets increases, it becomes more important to have quality metadata for discovering and reusing them.
This paper proposes to leverage large language models (LLMs) for cost-effective annotation of subject metadata through the LLM-based in-context learning.
arXiv Detail & Related papers (2023-10-17T14:52:33Z) - Bridged-GNN: Knowledge Bridge Learning for Effective Knowledge Transfer [65.42096702428347]
Graph Neural Networks (GNNs) aggregate information from neighboring nodes.
Knowledge Bridge Learning (KBL) learns a knowledge-enhanced posterior distribution for target domains.
Bridged-GNN includes an Adaptive Knowledge Retrieval module to build Bridged-Graph and a Graph Knowledge Transfer module.
arXiv Detail & Related papers (2023-08-18T12:14:51Z) - AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators [98.11286353828525]
GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks.
We propose AnnoLLM, which adopts a two-step approach, explain-then-annotate.
We build the first conversation-based information retrieval dataset employing AnnoLLM.
arXiv Detail & Related papers (2023-03-29T17:03:21Z) - Deep Transfer Learning for Multi-source Entity Linkage via Domain
Adaptation [63.24594955429465]
Multi-source entity linkage is critical in high-impact applications such as data cleaning and user stitching.
AdaMEL is a deep transfer learning framework that learns generic high-level knowledge to perform multi-source entity linkage.
Our framework achieves state-of-the-art results with 8.21% improvement on average over methods based on supervised learning.
arXiv Detail & Related papers (2021-10-27T15:20:41Z) - HRKD: Hierarchical Relational Knowledge Distillation for Cross-domain
Language Model Compression [53.90578309960526]
Large pre-trained language models (PLMs) have shown overwhelming performances compared with traditional neural network methods.
We propose a hierarchical relational knowledge distillation (HRKD) method to capture both hierarchical and domain relational information.
arXiv Detail & Related papers (2021-10-16T11:23:02Z) - Towards Data-Free Domain Generalization [12.269045654957765]
How can knowledge contained in models trained on different source data domains be merged into a single model that generalizes well to unseen target domains?
Prior domain generalization methods typically rely on using source domain data, making them unsuitable for private decentralized data.
We propose DEKAN, an approach that extracts and fuses domain-specific knowledge from the available teacher models into a student model robust to domain shift.
arXiv Detail & Related papers (2021-10-09T11:44:05Z) - Data Augmentation for Cross-Domain Named Entity Recognition [22.66649873447105]
We study cross-domain data augmentation for the named entity recognition task.
We propose a novel neural architecture to transform the data representation from a high-resource to a low-resource domain.
We show that transforming the data to the low-resource domain representation achieves significant improvements over only using data from high-resource domains.
arXiv Detail & Related papers (2021-09-04T00:50:55Z) - Inferring Latent Domains for Unsupervised Deep Domain Adaptation [54.963823285456925]
Unsupervised Domain Adaptation (UDA) refers to the problem of learning a model in a target domain where labeled data are not available.
This paper introduces a novel deep architecture which addresses the problem of UDA by automatically discovering latent domains in visual datasets.
We evaluate our approach on publicly available benchmarks, showing that it outperforms state-of-the-art domain adaptation methods.
arXiv Detail & Related papers (2021-03-25T14:33:33Z) - Dynamic Fusion Network for Multi-Domain End-to-end Task-Oriented Dialog [70.79442700890843]
We propose a novel Dynamic Fusion Network (DF-Net) which automatically exploit the relevance between the target domain and each domain.
With little training data, we show its transferability by outperforming prior best model by 13.9% on average.
arXiv Detail & Related papers (2020-04-23T08:17:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.