Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation
for Generative AI
- URL: http://arxiv.org/abs/2401.14019v1
- Date: Thu, 25 Jan 2024 08:57:33 GMT
- Title: Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation
for Generative AI
- Authors: Elron Bandel, Yotam Perlitz, Elad Venezian, Roni Friedman-Melamed,
Ofir Arviv, Matan Orbach, Shachar Don-Yehyia, Dafna Sheinwald, Ariel Gera,
Leshem Choshen, Michal Shmueli-Scheuer, Yoav Katz
- Abstract summary: Unitxt is an innovative library for customizable textual data preparation and evaluation tailored to generative language models.
Unitxt integrates with common libraries like HFace and LM-eval-harness, enabling easy customization and sharing between practitioners.
Beyond being a tool, Unitxt is a community-driven platform, empowering users to build, share, and advance their pipelines.
- Score: 15.220987187105607
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In the dynamic landscape of generative NLP, traditional text processing
pipelines limit research flexibility and reproducibility, as they are tailored
to specific dataset, task, and model combinations. The escalating complexity,
involving system prompts, model-specific formats, instructions, and more, calls
for a shift to a structured, modular, and customizable solution. Addressing
this need, we present Unitxt, an innovative library for customizable textual
data preparation and evaluation tailored to generative language models. Unitxt
natively integrates with common libraries like HuggingFace and LM-eval-harness
and deconstructs processing flows into modular components, enabling easy
customization and sharing between practitioners. These components encompass
model-specific formats, task prompts, and many other comprehensive dataset
processing definitions. The Unitxt-Catalog centralizes these components,
fostering collaboration and exploration in modern textual data workflows.
Beyond being a tool, Unitxt is a community-driven platform, empowering users to
build, share, and advance their pipelines collaboratively. Join the Unitxt
community at https://github.com/IBM/unitxt!
Related papers
- From LIMA to DeepLIMA: following a new path of interoperability [2.5764171991553795]
We describe the architecture of the LIMA framework and its recent evolution with the addition of new text analysis modules based on deep neural networks.
Models were trained for more than 60 languages on the Universal Dependencies 2.5 corpora, WikiNer corpora, and CoNLL-03 dataset.
This integration of ubiquitous Deep Learning Natural Language Processing models and the use of standard annotated collections can be viewed as a new path of interoperability.
arXiv Detail & Related papers (2024-09-10T14:26:12Z) - MODOC: A Modular Interface for Flexible Interlinking of Text Retrieval and Text Generation Functions [8.624104798224085]
Large Language Models (LLMs) produce eloquent texts but often the content they generate needs to be verified.
Traditional information retrieval systems can assist with this task, but most systems have not been designed with LLM-generated queries in mind.
We present MODOC, a modular user interface that leverages the capabilities of LLMs and provides assistance with detecting their confabulations.
arXiv Detail & Related papers (2024-08-26T20:36:52Z) - CELA: Cost-Efficient Language Model Alignment for CTR Prediction [71.85120354973073]
Click-Through Rate (CTR) prediction holds a paramount position in recommender systems.
Recent efforts have sought to mitigate these challenges by integrating Pre-trained Language Models (PLMs)
We propose textbfCost-textbfEfficient textbfLanguage Model textbfAlignment (textbfCELA) for CTR prediction.
arXiv Detail & Related papers (2024-05-17T07:43:25Z) - Large Language User Interfaces: Voice Interactive User Interfaces powered by LLMs [5.06113628525842]
We present a framework that can serve as an intermediary between a user and their user interface (UI)
We employ a system that stands upon textual semantic mappings of UI components, in the form of annotations.
Our engine can classify the most appropriate application, extract relevant parameters, and subsequently execute precise predictions of the user's expected actions.
arXiv Detail & Related papers (2024-02-07T21:08:49Z) - Interfacing Foundation Models' Embeddings [131.0352288172788]
We present FIND, a generalized interface for aligning foundation models' embeddings with unified image and dataset-level understanding spanning modality and granularity.
In light of the interleaved embedding space, we introduce FIND-Bench, which introduces new training and evaluation annotations to the COCO dataset for interleaved segmentation and retrieval.
arXiv Detail & Related papers (2023-12-12T18:58:02Z) - CodeTF: One-stop Transformer Library for State-of-the-art Code LLM [72.1638273937025]
We present CodeTF, an open-source Transformer-based library for state-of-the-art Code LLMs and code intelligence.
Our library supports a collection of pretrained Code LLM models and popular code benchmarks.
We hope CodeTF is able to bridge the gap between machine learning/generative AI and software engineering.
arXiv Detail & Related papers (2023-05-31T05:24:48Z) - Learning Label Modular Prompts for Text Classification in the Wild [56.66187728534808]
We propose text classification in-the-wild, which introduces different non-stationary training/testing stages.
Decomposing a complex task into modular components can enable robust generalisation under such non-stationary environment.
We propose MODULARPROMPT, a label-modular prompt tuning framework for text classification tasks.
arXiv Detail & Related papers (2022-11-30T16:26:38Z) - A Data-Centric Framework for Composable NLP Workflows [109.51144493023533]
Empirical natural language processing systems in application domains (e.g., healthcare, finance, education) involve interoperation among multiple components.
We establish a unified open-source framework to support fast development of such sophisticated NLP in a composable manner.
arXiv Detail & Related papers (2021-03-02T16:19:44Z) - Text Modular Networks: Learning to Decompose Tasks in the Language of
Existing Models [61.480085460269514]
We propose a framework for building interpretable systems that learn to solve complex tasks by decomposing them into simpler ones solvable by existing models.
We use this framework to build ModularQA, a system that can answer multi-hop reasoning questions by decomposing them into sub-questions answerable by a neural factoid single-span QA model and a symbolic calculator.
arXiv Detail & Related papers (2020-09-01T23:45:42Z) - MixingBoard: a Knowledgeable Stylized Integrated Text Generation
Platform [32.50773822686633]
MixingBoard is a platform for building demos with a focus on knowledge grounded stylized text generation.
A user interface for local development, remote access, a webpage API are provided to make it simple for users to build their own demos.
arXiv Detail & Related papers (2020-05-17T20:29:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.