Related papers: TinyLLaVA Factory: A Modularized Codebase for Small-scale Large Multimodal Models

TinyLLaVA Factory: A Modularized Codebase for Small-scale Large Multimodal Models

URL: http://arxiv.org/abs/2405.11788v1
Date: Mon, 20 May 2024 05:11:02 GMT
Title: TinyLLaVA Factory: A Modularized Codebase for Small-scale Large Multimodal Models
Authors: Junlong Jia, Ying Hu, Xi Weng, Yiming Shi, Miao Li, Xingjian Zhang, Baichuan Zhou, Ziyu Liu, Jie Luo, Lei Huang, Ji Wu,
Abstract summary: We present TinyLLaVA Factory, an open-source modular for small-scale large models (LMMs) TinyLLaVA Factory modularizes the entire system into interchangeable components, with each component integrating a suite of cutting-edge models and methods. In addition to allowing users to customize their own LMMs, TinyLLaVA Factory provides popular training recipes to let users pretrain and finetune their models with less coding effort.
Score: 22.214259364977256
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present TinyLLaVA Factory, an open-source modular codebase for small-scale large multimodal models (LMMs) with a focus on simplicity of code implementations, extensibility of new features, and reproducibility of training results. Following the design philosophy of the factory pattern in software engineering, TinyLLaVA Factory modularizes the entire system into interchangeable components, with each component integrating a suite of cutting-edge models and methods, meanwhile leaving room for extensions to more features. In addition to allowing users to customize their own LMMs, TinyLLaVA Factory provides popular training recipes to let users pretrain and finetune their models with less coding effort. Empirical experiments validate the effectiveness of our codebase. The goal of TinyLLaVA Factory is to assist researchers and practitioners in exploring the wide landscape of designing and training small-scale LMMs with affordable computational resources.

Related papers

Flow-Factory: A Unified Framework for Reinforcement Learning in Flow-Matching Models [30.65606997113044]
Flow-Factory is a framework that decouples algorithms, models, and rewards through a modular, registry-based architecture.<n>It empowers researchers to rapidly prototype and scale future innovations with ease.<n>Flow-Factory provides production-ready memory optimization, flexible multi-reward training, and seamless distributed training support.
arXiv Detail & Related papers (2026-02-13T02:21:59Z)
Open-Source Multimodal Moxin Models with Moxin-VLM and Moxin-VLA [53.68989489261506]
Moxin 7B is introduced as a fully open-source Large Language Models (LLMs)<n>We develop three variants based on Moxin, including Moxin-VLM, Moxin-VLA, and Moxin-Chinese.<n> Experiments show that our models achieve superior performance in various evaluations.
arXiv Detail & Related papers (2025-12-22T02:36:42Z)
LatentLLM: Attention-Aware Joint Tensor Compression [50.33925662486034]
Large language models (LLMs) and large multi-modal models (LMMs) require a massive amount of computational and memory resources.<n>We propose a new framework to convert such LLMs/LMMs into a reduced-dimension latent structure.
arXiv Detail & Related papers (2025-05-23T22:39:54Z)
MLKV: Efficiently Scaling up Large Embedding Model Training with Disk-based Key-Value Storage [22.848456481878568]
This paper presents MLKV, an efficient, reusable data storage framework designed to address the scalability challenges in embedding model training. In experiments on open-source workloads, MLKV outperforms offloading strategies built on top of industrial-strength key-value stores by 1.6-12.6x.
arXiv Detail & Related papers (2025-04-02T08:57:01Z)
HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding [67.24430397016275]
We propose a new early-fusion LMM that can fuse multi-modal inputs in the early stage and respond to visual instructions in an auto-regressive manner. The proposed model demonstrates superior performance compared to other LMMs using one transformer and significantly narrows the performance gap with compositional LMMs.
arXiv Detail & Related papers (2025-03-12T06:01:05Z)
MMFactory: A Universal Solution Search Engine for Vision-Language Tasks [35.262080125288115]
We introduce MMFactory, a universal framework that acts like a solution search engine across various available models. Based on a task description and few sample input-output pairs, MMFactory can suggest a diverse pool of programmatic solutions. MMFactory also proposes metrics and benchmarks performance / resource characteristics, allowing users to pick a solution that meets their unique design constraints.
arXiv Detail & Related papers (2024-12-24T00:59:16Z)
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models [157.44696790158784]
This report introduces xGen-MM, a framework for developing Large Multimodal Models (LMMs) The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks.
arXiv Detail & Related papers (2024-08-16T17:57:01Z)
Is Modularity Transferable? A Case Study through the Lens of Knowledge Distillation [59.37775534633868]
We present an extremely straightforward approach to transferring pre-trained, task-specific PEFT modules between same-family PLMs. We also propose a method that allows the transfer of modules between incompatible PLMs without any change in the inference complexity.
arXiv Detail & Related papers (2024-03-27T17:50:00Z)
TinyLLaVA: A Framework of Small-scale Large Multimodal Models [11.686023770810937]
We study the effects of different vision encoders, connection modules, language models, training data and training recipes. Under our framework, we train a family of small-scale LMMs. Our best model, TinyLLaVA-3.1B, achieves better overall performance against existing 7B models such as LLaVA-1.5 and Qwen-VL.
arXiv Detail & Related papers (2024-02-22T05:05:30Z)
ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models [51.35570730554632]
ESPnet-SPK is a toolkit for training speaker embedding extractors. We provide several models, ranging from x-vector to recent SKA-TDNN. We also aspire to bridge developed models with other domains.
arXiv Detail & Related papers (2024-01-30T18:18:27Z)
CodeTF: One-stop Transformer Library for State-of-the-art Code LLM [72.1638273937025]
We present CodeTF, an open-source Transformer-based library for state-of-the-art Code LLMs and code intelligence. Our library supports a collection of pretrained Code LLM models and popular code benchmarks. We hope CodeTF is able to bridge the gap between machine learning/generative AI and software engineering.
arXiv Detail & Related papers (2023-05-31T05:24:48Z)
Model LEGO: Creating Models Like Disassembling and Assembling Building Blocks [53.09649785009528]
In this paper, we explore a paradigm that does not require training to obtain new models. Similar to the birth of CNN inspired by receptive fields in the biological visual system, we propose Model Disassembling and Assembling. For model assembling, we present the alignment padding strategy and parameter scaling strategy to construct a new model tailored for a specific task.
arXiv Detail & Related papers (2022-03-25T05:27:28Z)
Ensemble Distillation for Robust Model Fusion in Federated Learning [72.61259487233214]
Federated Learning (FL) is a machine learning setting where many devices collaboratively train a machine learning model. In most of the current training schemes the central model is refined by averaging the parameters of the server model and the updated parameters from the client side. We propose ensemble distillation for model fusion, i.e. training the central classifier through unlabeled data on the outputs of the models from the clients.
arXiv Detail & Related papers (2020-06-12T14:49:47Z)
The Collective Knowledge project: making ML models more portable and reproducible with open APIs, reusable best practices and MLOps [0.2538209532048866]
This article provides an overview of the Collective Knowledge technology (CK or cKnowledge CK) CK attempts to make it easier to reproduce ML&systems research, deploy ML models in production, and adapt them to changing data sets, models, research techniques, software, and hardware.
arXiv Detail & Related papers (2020-06-12T13:18:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.