Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation
for Generative AI
- URL: http://arxiv.org/abs/2401.14019v1
- Date: Thu, 25 Jan 2024 08:57:33 GMT
- Title: Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation
for Generative AI
- Authors: Elron Bandel, Yotam Perlitz, Elad Venezian, Roni Friedman-Melamed,
Ofir Arviv, Matan Orbach, Shachar Don-Yehyia, Dafna Sheinwald, Ariel Gera,
Leshem Choshen, Michal Shmueli-Scheuer, Yoav Katz
- Abstract summary: Unitxt is an innovative library for customizable textual data preparation and evaluation tailored to generative language models.
Unitxt integrates with common libraries like HFace and LM-eval-harness, enabling easy customization and sharing between practitioners.
Beyond being a tool, Unitxt is a community-driven platform, empowering users to build, share, and advance their pipelines.
- Score: 15.220987187105607
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In the dynamic landscape of generative NLP, traditional text processing
pipelines limit research flexibility and reproducibility, as they are tailored
to specific dataset, task, and model combinations. The escalating complexity,
involving system prompts, model-specific formats, instructions, and more, calls
for a shift to a structured, modular, and customizable solution. Addressing
this need, we present Unitxt, an innovative library for customizable textual
data preparation and evaluation tailored to generative language models. Unitxt
natively integrates with common libraries like HuggingFace and LM-eval-harness
and deconstructs processing flows into modular components, enabling easy
customization and sharing between practitioners. These components encompass
model-specific formats, task prompts, and many other comprehensive dataset
processing definitions. The Unitxt-Catalog centralizes these components,
fostering collaboration and exploration in modern textual data workflows.
Beyond being a tool, Unitxt is a community-driven platform, empowering users to
build, share, and advance their pipelines collaboratively. Join the Unitxt
community at https://github.com/IBM/unitxt!
Related papers
- Chunk-Distilled Language Modeling [25.238256586953487]
Chunk-Distilled Language Modeling (CD-LM) is an approach to text generation that addresses two challenges in current large language models (LLMs)
Our method combines deep network-based LLMs with a straightforward retrieval module, which allows the generation of multi-token text chunks at a single decoding step.
arXiv Detail & Related papers (2024-12-31T08:32:15Z) - Commit0: Library Generation from Scratch [77.38414688148006]
Commit0 is a benchmark that challenges AI agents to write libraries from scratch.
Agents are provided with a specification document outlining the library's API as well as a suite of interactive unit tests.
Commit0 also offers an interactive environment where models receive static analysis and execution feedback on the code they generate.
arXiv Detail & Related papers (2024-12-02T18:11:30Z) - From LIMA to DeepLIMA: following a new path of interoperability [2.5764171991553795]
We describe the architecture of the LIMA framework and its recent evolution with the addition of new text analysis modules based on deep neural networks.
Models were trained for more than 60 languages on the Universal Dependencies 2.5 corpora, WikiNer corpora, and CoNLL-03 dataset.
This integration of ubiquitous Deep Learning Natural Language Processing models and the use of standard annotated collections can be viewed as a new path of interoperability.
arXiv Detail & Related papers (2024-09-10T14:26:12Z) - MODOC: A Modular Interface for Flexible Interlinking of Text Retrieval and Text Generation Functions [8.624104798224085]
Large Language Models (LLMs) produce eloquent texts but often the content they generate needs to be verified.
Traditional information retrieval systems can assist with this task, but most systems have not been designed with LLM-generated queries in mind.
We present MODOC, a modular user interface that leverages the capabilities of LLMs and provides assistance with detecting their confabulations.
arXiv Detail & Related papers (2024-08-26T20:36:52Z) - CELA: Cost-Efficient Language Model Alignment for CTR Prediction [70.65910069412944]
Click-Through Rate (CTR) prediction holds a paramount position in recommender systems.
Recent efforts have sought to mitigate these challenges by integrating Pre-trained Language Models (PLMs)
We propose textbfCost-textbfEfficient textbfLanguage Model textbfAlignment (textbfCELA) for CTR prediction.
arXiv Detail & Related papers (2024-05-17T07:43:25Z) - Large Language User Interfaces: Voice Interactive User Interfaces powered by LLMs [5.06113628525842]
We present a framework that can serve as an intermediary between a user and their user interface (UI)
We employ a system that stands upon textual semantic mappings of UI components, in the form of annotations.
Our engine can classify the most appropriate application, extract relevant parameters, and subsequently execute precise predictions of the user's expected actions.
arXiv Detail & Related papers (2024-02-07T21:08:49Z) - Interfacing Foundation Models' Embeddings [131.0352288172788]
We present FIND, a generalized interface for aligning foundation models' embeddings with unified image and dataset-level understanding spanning modality and granularity.
In light of the interleaved embedding space, we introduce FIND-Bench, which introduces new training and evaluation annotations to the COCO dataset for interleaved segmentation and retrieval.
arXiv Detail & Related papers (2023-12-12T18:58:02Z) - CodeTF: One-stop Transformer Library for State-of-the-art Code LLM [72.1638273937025]
We present CodeTF, an open-source Transformer-based library for state-of-the-art Code LLMs and code intelligence.
Our library supports a collection of pretrained Code LLM models and popular code benchmarks.
We hope CodeTF is able to bridge the gap between machine learning/generative AI and software engineering.
arXiv Detail & Related papers (2023-05-31T05:24:48Z) - A Data-Centric Framework for Composable NLP Workflows [109.51144493023533]
Empirical natural language processing systems in application domains (e.g., healthcare, finance, education) involve interoperation among multiple components.
We establish a unified open-source framework to support fast development of such sophisticated NLP in a composable manner.
arXiv Detail & Related papers (2021-03-02T16:19:44Z) - Text Modular Networks: Learning to Decompose Tasks in the Language of
Existing Models [61.480085460269514]
We propose a framework for building interpretable systems that learn to solve complex tasks by decomposing them into simpler ones solvable by existing models.
We use this framework to build ModularQA, a system that can answer multi-hop reasoning questions by decomposing them into sub-questions answerable by a neural factoid single-span QA model and a symbolic calculator.
arXiv Detail & Related papers (2020-09-01T23:45:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.