Related papers: Multi-aspect Knowledge Distillation with Large Language Model

Multi-aspect Knowledge Distillation with Large Language Model

URL: http://arxiv.org/abs/2501.13341v3
Date: Tue, 28 Jan 2025 02:12:59 GMT
Title: Multi-aspect Knowledge Distillation with Large Language Model
Authors: Taegyeong Lee, Jinsik Bang, Soyeong Kwon, Taehwan Kim,
Abstract summary: We propose a multi-aspect knowledge distillation method using Multimodal Large Language Models (MLLMs)<n>We primarily apply our method to image classification, and to explore the potential for extending our model, we expand it to other tasks, such as object detection.
Score: 2.317771311576205
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Recent advancements in deep learning have significantly improved performance on computer vision tasks. Previous image classification methods primarily modify model architectures or add features, and they optimize models using cross-entropy loss on class logits. Since they focus on classifying images with considering class labels, these methods may struggle to learn various \emph{aspects} of classes (e.g., natural positions and shape changes). Rethinking the previous approach from a novel view, we propose a multi-aspect knowledge distillation method using Multimodal Large Language Models (MLLMs). Our approach involves: 1) querying Large Language Model with multi-aspect questions relevant to the knowledge we want to transfer to the model, 2) extracting corresponding logits from MLLM, and 3) expanding the model's output dimensions to distill these multi-aspect logits. We then apply cross-entropy loss to class logits and binary cross-entropy loss to multi-aspect logits. Through our method, the model can learn not only the knowledge about visual aspects but also the abstract and complex aspects that require a deeper understanding. We primarily apply our method to image classification, and to explore the potential for extending our model, we expand it to other tasks, such as object detection. In all experimental results, our method improves the performance of the baselines. Additionally, we analyze the effect of multi-aspect knowledge distillation. These results demonstrate that our method can transfer knowledge about various aspects to the model and the aspect knowledge can enhance model performance in computer vision tasks. This paper demonstrates the great potential of multi-aspect knowledge distillation, and we believe it offers a promising direction for future research in computer vision and beyond.

Related papers

LLAVADI: What Matters For Multimodal Large Language Models Distillation [77.73964744238519]
In this work, we do not propose a new efficient model structure or train small-scale MLLMs from scratch. Our studies involve training strategies, model choices, and distillation algorithms in the knowledge distillation process. By evaluating different benchmarks and proper strategy, even a 2.7B small-scale model can perform on par with larger models with 7B or 13B parameters.
arXiv Detail & Related papers (2024-07-28T06:10:47Z)
Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest. This technique allows LVLMs to access more detailed visual information without altering the original image resolution. Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z)
Masked Modeling for Self-supervised Representation Learning on Vision and Beyond [69.64364187449773]
Masked modeling has emerged as a distinctive approach that involves predicting parts of the original data that are proportionally masked during training. We elaborate on the details of techniques within masked modeling, including diverse masking strategies, recovering targets, network architectures, and more. We conclude by discussing the limitations of current techniques and point out several potential avenues for advancing masked modeling research.
arXiv Detail & Related papers (2023-12-31T12:03:21Z)
Sequential Modeling Enables Scalable Learning for Large Vision Models [120.91839619284431]
We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data. We define a common format, "visual sentences", in which we can represent raw images and videos as well as annotated data sources.
arXiv Detail & Related papers (2023-12-01T18:59:57Z)
Heterogeneous Generative Knowledge Distillation with Masked Image Modeling [33.95780732124864]
Masked image modeling (MIM) methods achieve great success in various visual tasks but remain largely unexplored in knowledge distillation for heterogeneous deep models. We develop the first Heterogeneous Generative Knowledge Distillation (H-GKD) based on MIM, which can efficiently transfer knowledge from large Transformer models to small CNN-based models in a generative self-supervised fashion. Our method is a simple yet effective learning paradigm to learn the visual representation and distribution of data from heterogeneous teacher models.
arXiv Detail & Related papers (2023-09-18T08:30:55Z)
MinT: Boosting Generalization in Mathematical Reasoning via Multi-View Fine-Tuning [53.90744622542961]
Reasoning in mathematical domains remains a significant challenge for small language models (LMs) We introduce a new method that exploits existing mathematical problem datasets with diverse annotation styles. Experimental results show that our strategy enables a LLaMA-7B model to outperform prior approaches.
arXiv Detail & Related papers (2023-07-16T05:41:53Z)
Prototype-guided Cross-task Knowledge Distillation for Large-scale Models [103.04711721343278]
Cross-task knowledge distillation helps to train a small student model to obtain a competitive performance. We propose a Prototype-guided Cross-task Knowledge Distillation (ProC-KD) approach to transfer the intrinsic local-level object knowledge of a large-scale teacher network to various task scenarios.
arXiv Detail & Related papers (2022-12-26T15:00:42Z)
Empirical Performance Analysis of Conventional Deep Learning Models for Recognition of Objects in 2-D Images [0.0]
We have varied parameters like learning rate, filter size, the number of hidden layers, stride size and the activation function among others to analyze the performance of the model. The model classifies images into 3 categories, namely, cars, faces and aeroplanes.
arXiv Detail & Related papers (2020-11-12T20:14:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.