Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation
  for Generative AI
        - URL: http://arxiv.org/abs/2401.14019v1
 - Date: Thu, 25 Jan 2024 08:57:33 GMT
 - Title: Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation
  for Generative AI
 - Authors: Elron Bandel, Yotam Perlitz, Elad Venezian, Roni Friedman-Melamed,
  Ofir Arviv, Matan Orbach, Shachar Don-Yehyia, Dafna Sheinwald, Ariel Gera,
  Leshem Choshen, Michal Shmueli-Scheuer, Yoav Katz
 - Abstract summary: Unitxt is an innovative library for customizable textual data preparation and evaluation tailored to generative language models.
Unitxt integrates with common libraries like HFace and LM-eval-harness, enabling easy customization and sharing between practitioners.
Beyond being a tool, Unitxt is a community-driven platform, empowering users to build, share, and advance their pipelines.
 - Score: 15.220987187105607
 - License: http://creativecommons.org/licenses/by-nc-nd/4.0/
 - Abstract:   In the dynamic landscape of generative NLP, traditional text processing
pipelines limit research flexibility and reproducibility, as they are tailored
to specific dataset, task, and model combinations. The escalating complexity,
involving system prompts, model-specific formats, instructions, and more, calls
for a shift to a structured, modular, and customizable solution. Addressing
this need, we present Unitxt, an innovative library for customizable textual
data preparation and evaluation tailored to generative language models. Unitxt
natively integrates with common libraries like HuggingFace and LM-eval-harness
and deconstructs processing flows into modular components, enabling easy
customization and sharing between practitioners. These components encompass
model-specific formats, task prompts, and many other comprehensive dataset
processing definitions. The Unitxt-Catalog centralizes these components,
fostering collaboration and exploration in modern textual data workflows.
Beyond being a tool, Unitxt is a community-driven platform, empowering users to
build, share, and advance their pipelines collaboratively. Join the Unitxt
community at https://github.com/IBM/unitxt!
 
       
      
        Related papers
        - SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic   Association and Long Story Comprehension [77.93156509994994]
We show how to represent short chunks in a way that is conditioned on a broader context window to enhance retrieval performance.<n>Existing embedding models are not well-equipped to encode such situated context effectively.<n>Our method substantially outperforms state-of-the-art embedding models.
arXiv  Detail & Related papers  (2025-08-03T23:59:31Z) - Leveraging Machine Learning and Enhanced Parallelism Detection for BPMN   Model Generation from Text [75.77648333476776]
This paper introduces an automated pipeline for extracting BPMN models from text.<n>A key contribution of this work is the introduction of a newly annotated dataset.<n>We augment the dataset with 15 newly annotated documents containing 32 parallel gateways for model training.
arXiv  Detail & Related papers  (2025-07-11T07:25:55Z) - RouteNator: A Router-Based Multi-Modal Architecture for Generating   Synthetic Training Data for Function Calling LLMs [3.41612427812159]
In digital content creation tools, users express their needs through natural language queries that must be mapped to API calls.<n>Existing approaches to synthetic data generation fail to replicate real-world data distributions.<n>We present a novel router-based architecture that generates high-quality synthetic training data.
arXiv  Detail & Related papers  (2025-05-15T16:53:45Z) - Adaptive Orchestration of Modular Generative Information Access Systems [59.102816309859584]
We argue that the architecture of future modular generative information access systems will not just assemble powerful components, but enable a self-organizing system.
This perspective urges the IR community to rethink modular system designs for developing adaptive, self-optimizing, and future-ready architectures.
arXiv  Detail & Related papers  (2025-04-24T11:35:43Z) - Langformers: Unified NLP Pipelines for Language Models [3.690904966341072]
Langformers is an open-source Python library designed to streamline NLP pipelines.
It integrates conversational AI, pretraining, text classification, sentence embedding/reranking, data labelling, semantic search, and knowledge distillation into a cohesive API.
arXiv  Detail & Related papers  (2025-04-12T10:17:49Z) - Chunk-Distilled Language Modeling [25.238256586953487]
Chunk-Distilled Language Modeling (CD-LM) is an approach to text generation that addresses two challenges in current large language models (LLMs)
Our method combines deep network-based LLMs with a straightforward retrieval module, which allows the generation of multi-token text chunks at a single decoding step.
arXiv  Detail & Related papers  (2024-12-31T08:32:15Z) - Commit0: Library Generation from Scratch [77.38414688148006]
Commit0 is a benchmark that challenges AI agents to write libraries from scratch.
Agents are provided with a specification document outlining the library's API as well as a suite of interactive unit tests.
 Commit0 also offers an interactive environment where models receive static analysis and execution feedback on the code they generate.
arXiv  Detail & Related papers  (2024-12-02T18:11:30Z) - From LIMA to DeepLIMA: following a new path of interoperability [2.5764171991553795]
We describe the architecture of the LIMA framework and its recent evolution with the addition of new text analysis modules based on deep neural networks.
Models were trained for more than 60 languages on the Universal Dependencies 2.5 corpora, WikiNer corpora, and CoNLL-03 dataset.
This integration of ubiquitous Deep Learning Natural Language Processing models and the use of standard annotated collections can be viewed as a new path of interoperability.
arXiv  Detail & Related papers  (2024-09-10T14:26:12Z) - MODOC: A Modular Interface for Flexible Interlinking of Text Retrieval   and Text Generation Functions [8.624104798224085]
Large Language Models (LLMs) produce eloquent texts but often the content they generate needs to be verified.
Traditional information retrieval systems can assist with this task, but most systems have not been designed with LLM-generated queries in mind.
We present MODOC, a modular user interface that leverages the capabilities of LLMs and provides assistance with detecting their confabulations.
arXiv  Detail & Related papers  (2024-08-26T20:36:52Z) - CELA: Cost-Efficient Language Model Alignment for CTR Prediction [71.85120354973073]
Click-Through Rate (CTR) prediction holds a paramount position in recommender systems.
Recent efforts have sought to mitigate these challenges by integrating Pre-trained Language Models (PLMs)
We propose textbfCost-textbfEfficient textbfLanguage Model textbfAlignment (textbfCELA) for CTR prediction.
arXiv  Detail & Related papers  (2024-05-17T07:43:25Z) - Large Language User Interfaces: Voice Interactive User Interfaces   powered by LLMs [5.06113628525842]
We present a framework that can serve as an intermediary between a user and their user interface (UI)
We employ a system that stands upon textual semantic mappings of UI components, in the form of annotations.
Our engine can classify the most appropriate application, extract relevant parameters, and subsequently execute precise predictions of the user's expected actions.
arXiv  Detail & Related papers  (2024-02-07T21:08:49Z) - Interfacing Foundation Models' Embeddings [131.0352288172788]
We present FIND, a generalized interface for aligning foundation models' embeddings with unified image and dataset-level understanding spanning modality and granularity.
In light of the interleaved embedding space, we introduce FIND-Bench, which introduces new training and evaluation annotations to the COCO dataset for interleaved segmentation and retrieval.
arXiv  Detail & Related papers  (2023-12-12T18:58:02Z) - CodeTF: One-stop Transformer Library for State-of-the-art Code LLM [72.1638273937025]
We present CodeTF, an open-source Transformer-based library for state-of-the-art Code LLMs and code intelligence.
Our library supports a collection of pretrained Code LLM models and popular code benchmarks.
We hope CodeTF is able to bridge the gap between machine learning/generative AI and software engineering.
arXiv  Detail & Related papers  (2023-05-31T05:24:48Z) - Learning Label Modular Prompts for Text Classification in the Wild [56.66187728534808]
We propose text classification in-the-wild, which introduces different non-stationary training/testing stages.
Decomposing a complex task into modular components can enable robust generalisation under such non-stationary environment.
We propose MODULARPROMPT, a label-modular prompt tuning framework for text classification tasks.
arXiv  Detail & Related papers  (2022-11-30T16:26:38Z) - A Data-Centric Framework for Composable NLP Workflows [109.51144493023533]
Empirical natural language processing systems in application domains (e.g., healthcare, finance, education) involve interoperation among multiple components.
We establish a unified open-source framework to support fast development of such sophisticated NLP in a composable manner.
arXiv  Detail & Related papers  (2021-03-02T16:19:44Z) - Text Modular Networks: Learning to Decompose Tasks in the Language of
  Existing Models [61.480085460269514]
We propose a framework for building interpretable systems that learn to solve complex tasks by decomposing them into simpler ones solvable by existing models.
We use this framework to build ModularQA, a system that can answer multi-hop reasoning questions by decomposing them into sub-questions answerable by a neural factoid single-span QA model and a symbolic calculator.
arXiv  Detail & Related papers  (2020-09-01T23:45:42Z) - MixingBoard: a Knowledgeable Stylized Integrated Text Generation
  Platform [32.50773822686633]
MixingBoard is a platform for building demos with a focus on knowledge grounded stylized text generation.
A user interface for local development, remote access, a webpage API are provided to make it simple for users to build their own demos.
arXiv  Detail & Related papers  (2020-05-17T20:29:27Z) 
        This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.