Related papers: DSDL: Data Set Description Language for Bridging Modalities and Tasks in AI Data

DSDL: Data Set Description Language for Bridging Modalities and Tasks in AI Data

URL: http://arxiv.org/abs/2405.18315v1
Date: Tue, 28 May 2024 16:07:45 GMT
Title: DSDL: Data Set Description Language for Bridging Modalities and Tasks in AI Data
Authors: Bin Wang, Linke Ouyang, Fan Wu, Wenchang Ning, Xiao Han, Zhiyuan Zhao, Jiahui Peng, Yiying Jiang, Dahua Lin, Conghui He,
Abstract summary: In the era of artificial intelligence, the diversity of data modalities and annotation formats often renders data unusable directly. This article introduces a framework that aims to simplify dataset processing by providing a unified standard for AI datasets. The standardized specifications of DSDL reduce the workload for users in data dissemination, processing, and usage.
Score: 50.88106211204689
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In the era of artificial intelligence, the diversity of data modalities and annotation formats often renders data unusable directly, requiring understanding and format conversion before it can be used by researchers or developers with different needs. To tackle this problem, this article introduces a framework called Dataset Description Language (DSDL) that aims to simplify dataset processing by providing a unified standard for AI datasets. DSDL adheres to the three basic practical principles of generic, portable, and extensible, using a unified standard to express data of different modalities and structures, facilitating the dissemination of AI data, and easily extending to new modalities and tasks. The standardized specifications of DSDL reduce the workload for users in data dissemination, processing, and usage. To further improve user convenience, we provide predefined DSDL templates for various tasks, convert mainstream datasets to comply with DSDL specifications, and provide comprehensive documentation and DSDL tools. These efforts aim to simplify the use of AI data, thereby improving the efficiency of AI development.

Related papers

TAIJI: MCP-based Multi-Modal Data Analytics on Data Lakes [25.05627023905607]
We envision a new multi-modal data analytics system based on the Model Context Protocol (MCP)<n>First, we define a semantic operator hierarchy tailored for querying multi-modal data in data lakes.<n>Next, we introduce an MCP-based execution framework, in which each MCP server hosts specialized foundation models optimized for specific data modalities.
arXiv Detail & Related papers (2025-05-16T14:03:30Z)
A New Paradigm of User-Centric Wireless Communication Driven by Large Language Models [53.16213723669751]
Next generation of wireless communications seeks to deeply integrate artificial intelligence with user-centric communication networks. We propose a novel paradigm for wireless communication that innovatively incorporates the nature language to structured query language. We present a prototype system in which a dynamic semantic representation network at the physical layer adapts its encoding depth to meet user requirements.
arXiv Detail & Related papers (2025-04-16T01:43:36Z)
Building Instruction-Tuning Datasets from Human-Written Instructions with Open-Weight Large Language Models [22.16558378953053]
We build state-of-the-art instruction-tuning datasets sourced from human-written instructions. LLMs fine-tuned on our datasets consistently outperform those fine-tuned on existing ones. Analyses suggest that instruction-tuning in a new language allows LLMs to follow instructions, while the tuned models exhibit a notable lack of culture-specific knowledge in that language.
arXiv Detail & Related papers (2025-03-31T04:28:38Z)
A Text-Based Knowledge-Embedded Soft Sensing Modeling Approach for General Industrial Process Tasks Based on Large Language Model [16.842988666530204]
Data-driven soft sensors (DDSS) have become mainstream methods for predicting key performance indicators in process industries. Development requires complex and costly customized designs tailored to various tasks during the modeling process. We propose a general framework named LLM-TKESS (large language model for text-based knowledge-embedded soft sensing) for enhanced soft sensing modeling.
arXiv Detail & Related papers (2025-01-09T08:59:14Z)
LLMs for Generalizable Language-Conditioned Policy Learning under Minimal Data Requirements [50.544186914115045]
This paper presents TEDUO, a novel training pipeline for offline language-conditioned policy learning. TEDUO operates on easy-to-obtain, unlabeled datasets and is suited for the so-called in-the-wild evaluation, wherein the agent encounters previously unseen goals and states.
arXiv Detail & Related papers (2024-12-09T18:43:56Z)
DiffLM: Controllable Synthetic Data Generation via Diffusion Language Models [38.59653405736706]
We introduce DiffLM, a controllable data synthesis framework based on variational autoencoder (VAE) We show that DiffLM generates high-quality data, with performance on downstream tasks surpassing that of real data by 2-7 percent in certain cases.
arXiv Detail & Related papers (2024-11-05T16:47:53Z)
Federated Data-Efficient Instruction Tuning for Large Language Models [34.35613476734293]
Federated data-efficient instruction tuning for large language models, FedHDS, is presented. It reduces the redundancy of data samples at both intra-client and inter-client levels. Experiments show that FedHDS significantly reduces the amount of data required for fine-tuning while improving the responsiveness of the instruction-tuned LLMs to unseen tasks.
arXiv Detail & Related papers (2024-10-14T15:05:51Z)
Iterative Data Generation with Large Language Models for Aspect-based Sentiment Analysis [39.57537769578304]
We propose a systematic Iterative Data Generation framework, namely IDG, to boost the performance of ABSA. The core of IDG is to make full use of the powerful abilities (i.e., instruction-following, in-context learning and self-reflection) of LLMs to iteratively generate more fluent and diverse pseudo-label data. IDG brings consistent and significant performance gains among five baseline ABSA models.
arXiv Detail & Related papers (2024-06-29T07:00:37Z)
OpenDataLab: Empowering General Artificial Intelligence with Open Datasets [53.22840149601411]
This paper introduces OpenDataLab, a platform designed to bridge the gap between diverse data sources and the need for unified data processing. OpenDataLab integrates a wide range of open-source AI datasets and enhances data acquisition efficiency through intelligent querying and high-speed downloading services. We anticipate that OpenDataLab will significantly boost artificial general intelligence (AGI) research and facilitate advancements in related AI fields.
arXiv Detail & Related papers (2024-06-04T10:42:01Z)
Better Synthetic Data by Retrieving and Transforming Existing Datasets [63.875064274379824]
We introduce DataTune, a method to make better use of publicly available datasets to improve automatic dataset generation. On a diverse set of language-based tasks, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49%. We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks.
arXiv Detail & Related papers (2024-04-22T17:15:32Z)
CodecLM: Aligning Language Models with Tailored Synthetic Data [51.59223474427153]
We introduce CodecLM, a framework for adaptively generating high-quality synthetic data for instruction-following abilities. We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution. We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples.
arXiv Detail & Related papers (2024-04-08T21:15:36Z)
LLMs with User-defined Prompts as Generic Data Operators for Reliable Data Processing [13.901862478287509]
We propose a new design pattern that large language models (LLMs) could work as a generic data operator (LLM-GDO) In the LLM-GDO design pattern, user-defined prompts (UDPs) are used to represent the data processing logic rather than implementations with a specific programming language. Fine-tuning LLMs with domain-specific data could enhance the performance on the domain-specific tasks which makes data processing knowledge-aware.
arXiv Detail & Related papers (2023-12-26T23:08:38Z)
STAR: Boosting Low-Resource Information Extraction by Structure-to-Text Data Generation with Large Language Models [56.27786433792638]
STAR is a data generation method that leverages Large Language Models (LLMs) to synthesize data instances. We design fine-grained step-by-step instructions to obtain the initial data instances. Our experiments show that the data generated by STAR significantly improve the performance of low-resource event extraction and relation extraction tasks.
arXiv Detail & Related papers (2023-05-24T12:15:19Z)
SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation. Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.