Related papers: Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI

Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI

URL: http://arxiv.org/abs/2401.14019v1
Date: Thu, 25 Jan 2024 08:57:33 GMT
Title: Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI
Authors: Elron Bandel, Yotam Perlitz, Elad Venezian, Roni Friedman-Melamed, Ofir Arviv, Matan Orbach, Shachar Don-Yehyia, Dafna Sheinwald, Ariel Gera, Leshem Choshen, Michal Shmueli-Scheuer, Yoav Katz
Abstract summary: Unitxt is an innovative library for customizable textual data preparation and evaluation tailored to generative language models. Unitxt integrates with common libraries like HFace and LM-eval-harness, enabling easy customization and sharing between practitioners. Beyond being a tool, Unitxt is a community-driven platform, empowering users to build, share, and advance their pipelines.
Score: 15.220987187105607
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: In the dynamic landscape of generative NLP, traditional text processing pipelines limit research flexibility and reproducibility, as they are tailored to specific dataset, task, and model combinations. The escalating complexity, involving system prompts, model-specific formats, instructions, and more, calls for a shift to a structured, modular, and customizable solution. Addressing this need, we present Unitxt, an innovative library for customizable textual data preparation and evaluation tailored to generative language models. Unitxt natively integrates with common libraries like HuggingFace and LM-eval-harness and deconstructs processing flows into modular components, enabling easy customization and sharing between practitioners. These components encompass model-specific formats, task prompts, and many other comprehensive dataset processing definitions. The Unitxt-Catalog centralizes these components, fostering collaboration and exploration in modern textual data workflows. Beyond being a tool, Unitxt is a community-driven platform, empowering users to build, share, and advance their pipelines collaboratively. Join the Unitxt community at https://github.com/IBM/unitxt!

Related papers

Adaptive Orchestration of Modular Generative Information Access Systems [59.102816309859584]
We argue that the architecture of future modular generative information access systems will not just assemble powerful components, but enable a self-organizing system. This perspective urges the IR community to rethink modular system designs for developing adaptive, self-optimizing, and future-ready architectures.
arXiv Detail & Related papers (2025-04-24T11:35:43Z)
Langformers: Unified NLP Pipelines for Language Models [3.690904966341072]
Langformers is an open-source Python library designed to streamline NLP pipelines. It integrates conversational AI, pretraining, text classification, sentence embedding/reranking, data labelling, semantic search, and knowledge distillation into a cohesive API.
arXiv Detail & Related papers (2025-04-12T10:17:49Z)
Chunk-Distilled Language Modeling [25.238256586953487]
Chunk-Distilled Language Modeling (CD-LM) is an approach to text generation that addresses two challenges in current large language models (LLMs) Our method combines deep network-based LLMs with a straightforward retrieval module, which allows the generation of multi-token text chunks at a single decoding step.
arXiv Detail & Related papers (2024-12-31T08:32:15Z)
Commit0: Library Generation from Scratch [77.38414688148006]
Commit0 is a benchmark that challenges AI agents to write libraries from scratch. Agents are provided with a specification document outlining the library's API as well as a suite of interactive unit tests. Commit0 also offers an interactive environment where models receive static analysis and execution feedback on the code they generate.
arXiv Detail & Related papers (2024-12-02T18:11:30Z)
From LIMA to DeepLIMA: following a new path of interoperability [2.5764171991553795]
We describe the architecture of the LIMA framework and its recent evolution with the addition of new text analysis modules based on deep neural networks. Models were trained for more than 60 languages on the Universal Dependencies 2.5 corpora, WikiNer corpora, and CoNLL-03 dataset. This integration of ubiquitous Deep Learning Natural Language Processing models and the use of standard annotated collections can be viewed as a new path of interoperability.
arXiv Detail & Related papers (2024-09-10T14:26:12Z)
MODOC: A Modular Interface for Flexible Interlinking of Text Retrieval and Text Generation Functions [8.624104798224085]
Large Language Models (LLMs) produce eloquent texts but often the content they generate needs to be verified. Traditional information retrieval systems can assist with this task, but most systems have not been designed with LLM-generated queries in mind. We present MODOC, a modular user interface that leverages the capabilities of LLMs and provides assistance with detecting their confabulations.
arXiv Detail & Related papers (2024-08-26T20:36:52Z)
CELA: Cost-Efficient Language Model Alignment for CTR Prediction [71.85120354973073]
Click-Through Rate (CTR) prediction holds a paramount position in recommender systems. Recent efforts have sought to mitigate these challenges by integrating Pre-trained Language Models (PLMs) We propose textbfCost-textbfEfficient textbfLanguage Model textbfAlignment (textbfCELA) for CTR prediction.
arXiv Detail & Related papers (2024-05-17T07:43:25Z)
Large Language User Interfaces: Voice Interactive User Interfaces powered by LLMs [5.06113628525842]
We present a framework that can serve as an intermediary between a user and their user interface (UI) We employ a system that stands upon textual semantic mappings of UI components, in the form of annotations. Our engine can classify the most appropriate application, extract relevant parameters, and subsequently execute precise predictions of the user's expected actions.
arXiv Detail & Related papers (2024-02-07T21:08:49Z)
Interfacing Foundation Models' Embeddings [131.0352288172788]
We present FIND, a generalized interface for aligning foundation models' embeddings with unified image and dataset-level understanding spanning modality and granularity. In light of the interleaved embedding space, we introduce FIND-Bench, which introduces new training and evaluation annotations to the COCO dataset for interleaved segmentation and retrieval.
arXiv Detail & Related papers (2023-12-12T18:58:02Z)
CodeTF: One-stop Transformer Library for State-of-the-art Code LLM [72.1638273937025]
We present CodeTF, an open-source Transformer-based library for state-of-the-art Code LLMs and code intelligence. Our library supports a collection of pretrained Code LLM models and popular code benchmarks. We hope CodeTF is able to bridge the gap between machine learning/generative AI and software engineering.
arXiv Detail & Related papers (2023-05-31T05:24:48Z)
Learning Label Modular Prompts for Text Classification in the Wild [56.66187728534808]
We propose text classification in-the-wild, which introduces different non-stationary training/testing stages. Decomposing a complex task into modular components can enable robust generalisation under such non-stationary environment. We propose MODULARPROMPT, a label-modular prompt tuning framework for text classification tasks.
arXiv Detail & Related papers (2022-11-30T16:26:38Z)
A Data-Centric Framework for Composable NLP Workflows [109.51144493023533]
Empirical natural language processing systems in application domains (e.g., healthcare, finance, education) involve interoperation among multiple components. We establish a unified open-source framework to support fast development of such sophisticated NLP in a composable manner.
arXiv Detail & Related papers (2021-03-02T16:19:44Z)
Text Modular Networks: Learning to Decompose Tasks in the Language of Existing Models [61.480085460269514]
We propose a framework for building interpretable systems that learn to solve complex tasks by decomposing them into simpler ones solvable by existing models. We use this framework to build ModularQA, a system that can answer multi-hop reasoning questions by decomposing them into sub-questions answerable by a neural factoid single-span QA model and a symbolic calculator.
arXiv Detail & Related papers (2020-09-01T23:45:42Z)
MixingBoard: a Knowledgeable Stylized Integrated Text Generation Platform [32.50773822686633]
MixingBoard is a platform for building demos with a focus on knowledge grounded stylized text generation. A user interface for local development, remote access, a webpage API are provided to make it simple for users to build their own demos.
arXiv Detail & Related papers (2020-05-17T20:29:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.