HuggingGraph: Understanding the Supply Chain of LLM Ecosystem
- URL: http://arxiv.org/abs/2507.14240v3
- Date: Thu, 04 Sep 2025 23:06:49 GMT
- Title: HuggingGraph: Understanding the Supply Chain of LLM Ecosystem
- Authors: Mohammad Shahedur Rahman, Peng Gao, Yuede Ji,
- Abstract summary: Large language models (LLMs) leverage deep learning architectures to process and predict sequences of words.<n>LLMs can inherit vulnerabilities, biases, or malicious components that exist in previous models or datasets.<n>Motivated by that, this project aims to study such relationships between models and datasets.
- Score: 6.131279654327215
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large language models (LLMs) leverage deep learning architectures to process and predict sequences of words, enabling them to perform a wide range of natural language processing tasks, such as translation, summarization, question answering, and content generation. As existing LLMs are often built from base models or other pre-trained models and use external datasets, they can inevitably inherit vulnerabilities, biases, or malicious components that exist in previous models or datasets. Therefore, it is critical to understand these components' origin and development process to detect potential risks, improve model fairness, and ensure compliance with regulatory frameworks. Motivated by that, this project aims to study such relationships between models and datasets, which are the central parts of the LLM supply chain. First, we design a methodology to systematically collect LLMs' supply chain information. Then, we design a new graph to model the relationships between models and datasets, which is a directed heterogeneous graph, having 402,654 nodes and 462,524 edges. Lastly, we perform different types of analysis and make multiple interesting findings.
Related papers
- Large Language Model Sourcing: A Survey [84.63438376832471]
Large language models (LLMs) have revolutionized artificial intelligence, shifting from supporting objective tasks to empowering subjective decision-making.<n>Due to the black-box nature of LLMs and the human-like quality of their generated content, issues such as hallucinations, bias, unfairness, and copyright infringement become significant.<n>This survey presents a systematic investigation into provenance tracking for content generated by LLMs, organized around four interrelated dimensions.
arXiv Detail & Related papers (2025-10-11T10:52:30Z) - Accurate and Consistent Graph Model Generation from Text with Large Language Models [1.9049294570026933]
Graph model generation from natural language description is an important task with many applications in software engineering.<n>With the rise of large language models (LLMs), there is a growing interest in using LLMs for graph model generation.<n>We propose a novel abstraction-concretization framework that enhances the consistency and quality of generated graph models.
arXiv Detail & Related papers (2025-08-01T01:52:25Z) - C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning [78.36259648527401]
C2-Evo is an automatic, closed-loop self-improving framework that jointly evolves both training data and model capabilities.<n>We show that C2-Evo consistently obtains considerable performance gains across multiple mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-07-22T12:27:08Z) - Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging [103.98582374569789]
Model merging aims to combine multiple expert models into a single model, thereby reducing storage and serving costs.<n>Previous studies have primarily focused on merging visual classification models or Large Language Models (LLMs) for code and math tasks.<n>We introduce the model merging benchmark for MLLMs, which includes multiple tasks such as VQA, Geometry, Chart, OCR, and Grounding, providing both LoRA and full fine-tuning models.
arXiv Detail & Related papers (2025-05-26T12:23:14Z) - GraphOmni: A Comprehensive and Extendable Benchmark Framework for Large Language Models on Graph-theoretic Tasks [26.992997870540435]
Graph Omni is a benchmark to evaluate the reasoning capabilities of LLMs on graph-theoretic tasks articulated in natural language.<n>We identify critical interactions among graph types, serialization formats, and prompting schemes, demonstrating their substantial impact on model performance.<n>We propose a reinforcement learning-inspired framework that adaptively selects the optimal factors influencing LLM reasoning capabilities.
arXiv Detail & Related papers (2025-04-17T09:01:16Z) - A Survey of Model Architectures in Information Retrieval [59.61734783818073]
The period from 2019 to the present has represented one of the biggest paradigm shifts in information retrieval (IR) and natural language processing (NLP)<n>We trace the development from traditional term-based methods to modern neural approaches, particularly highlighting the impact of transformer-based models and subsequent large language models (LLMs)<n>We conclude with a forward-looking discussion of emerging challenges and future directions.
arXiv Detail & Related papers (2025-02-20T18:42:58Z) - Graph Learning in the Era of LLMs: A Survey from the Perspective of Data, Models, and Tasks [25.720233631885726]
integration of Graph Neural Networks (GNNs) and Large Language Models (LLMs) has emerged as a promising technological paradigm.<n>We leverage graph description texts with rich semantic context to fundamentally enhance Data quality.<n>This work serves as a foundational reference for researchers and practitioners looking to advance graph learning methodologies.
arXiv Detail & Related papers (2024-12-17T01:41:17Z) - Building a Family of Data Augmentation Models for Low-cost LLM Fine-tuning on the Cloud [12.651588927599441]
We present a family of data augmentation models designed to significantly improve the efficiency for model fine-tuning.<n>These models, trained based on sufficiently small LLMs, support key functionalities with low inference costs.<n>Experiments and an application study prove the effectiveness of our approach.
arXiv Detail & Related papers (2024-12-06T09:04:12Z) - Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models [36.576853882830896]
We introduce EvolveDirector to train a text-to-image generation model comparable to advanced models using publicly available resources.
This framework interacts with advanced models through their public APIs to obtain text-image data pairs to train a base model.
We leverage pre-trained large vision-language models (VLMs) to guide the evolution of the base model.
arXiv Detail & Related papers (2024-10-09T17:52:28Z) - NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks.
We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities.
We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z) - Advancing Multimodal Large Language Models in Chart Question Answering with Visualization-Referenced Instruction Tuning [1.6570772838074355]
multimodal large language models (MLLMs) exhibit great potential for chart question answering (CQA)
Recent efforts primarily focus on scaling up training datasets through data collection and synthesis.
We propose a visualization-referenced instruction tuning approach to guide the training dataset enhancement and model development.
arXiv Detail & Related papers (2024-07-29T17:04:34Z) - Graph-based Unsupervised Disentangled Representation Learning via Multimodal Large Language Models [42.17166746027585]
We introduce a bidirectional weighted graph-based framework to learn factorized attributes and their interrelations within complex data.
Specifically, we propose a $beta$-VAE based module to extract factors as the initial nodes of the graph.
By integrating these complementary modules, our model successfully achieves fine-grained, practical and unsupervised disentanglement.
arXiv Detail & Related papers (2024-07-26T15:32:21Z) - From Supervised to Generative: A Novel Paradigm for Tabular Deep Learning with Large Language Models [18.219485459836285]
Generative Tabular Learning (GTL) is a novel framework that integrates the advanced functionalities of large language models (LLMs)
Our empirical study spans 384 public datasets, rigorously analyzing GTL's scaling behaviors.
GTL-LLaMA-2 model demonstrates superior zero-shot and in-context learning capabilities across numerous classification and regression tasks.
arXiv Detail & Related papers (2023-10-11T09:37:38Z) - Integrating Graphs with Large Language Models: Methods and Prospects [68.37584693537555]
Large language models (LLMs) have emerged as frontrunners, showcasing unparalleled prowess in diverse applications.
Merging the capabilities of LLMs with graph-structured data has been a topic of keen interest.
This paper bifurcates such integrations into two predominant categories.
arXiv Detail & Related papers (2023-10-09T07:59:34Z) - Adapting Large Language Models for Content Moderation: Pitfalls in Data
Engineering and Supervised Fine-tuning [79.53130089003986]
Large Language Models (LLMs) have become a feasible solution for handling tasks in various domains.
In this paper, we introduce how to fine-tune a LLM model that can be privately deployed for content moderation.
arXiv Detail & Related papers (2023-10-05T09:09:44Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP)
What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining.
How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.