Related papers: How to Bridge the Gap between Modalities: Survey on Multimodal Large Language Model

How to Bridge the Gap between Modalities: Survey on Multimodal Large Language Model

URL: http://arxiv.org/abs/2311.07594v3
Date: Wed, 08 Jan 2025 02:33:37 GMT
Title: How to Bridge the Gap between Modalities: Survey on Multimodal Large Language Model
Authors: Shezheng Song, Xiaopeng Li, Shasha Li, Shan Zhao, Jie Yu, Jun Ma, Xiaoguang Mao, Weimin Zhang,
Abstract summary: We explore Multimodal Large Language Models (MLLMs), which integrate LLMs to handle multimodal data, including text, images, audio, and more.<n>MLLMs face challenges in addressing the semantic gap in multimodal data, which may lead to erroneous outputs.<n>Implementing effective modality alignment can help LLMs address environmental issues and enhance accessibility.
Score: 12.358079352117699
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We explore Multimodal Large Language Models (MLLMs), which integrate LLMs like GPT-4 to handle multimodal data, including text, images, audio, and more. MLLMs demonstrate capabilities such as generating image captions and answering image-based questions, bridging the gap towards real-world human-computer interactions and hinting at a potential pathway to artificial general intelligence. However, MLLMs still face challenges in addressing the semantic gap in multimodal data, which may lead to erroneous outputs, posing potential risks to society. Selecting the appropriate modality alignment method is crucial, as improper methods might require more parameters without significant performance improvements. This paper aims to explore modality alignment methods for LLMs and their current capabilities. Implementing effective modality alignment can help LLMs address environmental issues and enhance accessibility. The study surveys existing modality alignment methods for MLLMs, categorizing them into four groups: (1) Multimodal Converter, which transforms data into a format that LLMs can understand; (2) Multimodal Perceiver, which improves how LLMs percieve different types of data; (3) Tool Learning, which leverages external tools to convert data into a common format, usually text; and (4) Data-Driven Method, which teaches LLMs to understand specific data types within datasets.

Related papers

LLMs can see and hear without any training [63.964888082106974]
MILS is a simple, training-free approach to imbue multimodal capabilities into your favorite LLM.<n>We establish a new state-of-the-art on emergent zero-shot image, video and audio captioning.<n>Being a gradient-free optimization approach, MILS can invert multimodal embeddings into text.
arXiv Detail & Related papers (2025-01-30T02:16:35Z)
FDLLM: A Text Fingerprint Detection Method for LLMs in Multi-Language, Multi-Domain Black-Box Environments [18.755880639770755]
Using large language models (LLMs) can lead to potential security risks. attackers may exploit this black-box scenario to deploy malicious models and embed viruses in the code provided to users. We propose the first LLMGT fingerprint detection model, textbfFDLLM, based on Qwen2.5-7B and fine-tuned using LoRA to address these challenges.
arXiv Detail & Related papers (2025-01-27T13:18:40Z)
Extract Information from Hybrid Long Documents Leveraging LLMs: A Framework and Dataset [52.286323454512996]
Large Language Models (LLMs) can comprehend and analyze hybrid text, containing textual and tabular data. We propose an Automated Information Extraction framework (AIE) to enable LLMs to process the hybrid long documents (HLDs) and carry out experiments to analyse four important aspects of information extraction from HLDs. To address the issue of dataset scarcity in HLDs and support future work, we also propose the Financial Reports Numerical Extraction (FINE) dataset.
arXiv Detail & Related papers (2024-12-28T07:54:14Z)
Towards Robust Evaluation of Unlearning in LLMs via Data Transformations [17.927224387698903]
Large Language Models (LLMs) have shown to be a great success in a wide range of applications ranging from regular NLP-based use cases to AI agents. In recent times research in the area of Machine Unlearning (MUL) has become active. Main idea is to force LLMs to forget (unlearn) certain information (e.g., PII) without suffering from performance loss on regular tasks.
arXiv Detail & Related papers (2024-11-23T07:20:36Z)
FedMLLM: Federated Fine-tuning MLLM on Multimodal Heterogeneity Data [64.50893177169996]
Fine-tuning Multimodal Large Language Models (MLLMs) with Federated Learning (FL) allows for expanding the training data scope by including private data sources. We introduce a benchmark for evaluating various downstream tasks in the federated fine-tuning of MLLMs within multimodal heterogeneous scenarios. We develop a general FedMLLM framework that integrates four representative FL methods alongside two modality-agnostic strategies.
arXiv Detail & Related papers (2024-11-22T04:09:23Z)
LLaVA-KD: A Framework of Distilling Multimodal Large Language Models [70.19607283302712]
We propose a novel framework to transfer knowledge from l-MLLM to s-MLLM. Specifically, we introduce Multimodal Distillation (MDist) to minimize the divergence between the visual-textual output distributions of l-MLLM and s-MLLM. We also propose a three-stage training scheme to fully exploit the potential of s-MLLM.
arXiv Detail & Related papers (2024-10-21T17:41:28Z)
Rethinking VLMs and LLMs for Image Classification [6.550471260627169]
Large Language Models (LLMs) are increasingly being merged with Visual Language Models (VLMs) to enable new capabilities. We show that, for object and scene recognition, VLMs that do not leverage LLMs can achieve better performance than VLMs that do. We propose a pragmatic solution: a lightweight fix involving a relatively small LLM that efficiently routes visual tasks to the most suitable model for the task.
arXiv Detail & Related papers (2024-10-03T23:40:21Z)
UniMEL: A Unified Framework for Multimodal Entity Linking with Large Language Models [0.42832989850721054]
Multimodal Entities Linking (MEL) is a crucial task that aims at linking ambiguous mentions within multimodal contexts to referent entities in a multimodal knowledge base, such as Wikipedia. Existing methods overcomplicate the MEL task and overlook the visual semantic information, which makes them costly and hard to scale. We propose UniMEL, a unified framework which establishes a new paradigm to process multimodal entity linking tasks using Large Language Models.
arXiv Detail & Related papers (2024-07-23T03:58:08Z)
Fine-tuning Multimodal Large Language Models for Product Bundling [53.01642741096356]
We introduce Bundle-MLLM, a novel framework that fine-tunes large language models (LLMs) through a hybrid item tokenization approach. Specifically, we integrate textual, media, and relational data into a unified tokenization, introducing a soft separation token to distinguish between textual and non-textual tokens. We propose a progressive optimization strategy that fine-tunes LLMs for disentangled objectives: 1) learning bundle patterns and 2) enhancing multimodal semantic understanding specific to product bundling.
arXiv Detail & Related papers (2024-07-16T13:30:14Z)
The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective [53.48484062444108]
We find that the development of models and data is not two separate paths but rather interconnected. On the one hand, vaster and higher-quality data contribute to better performance of MLLMs; on the other hand, MLLMs can facilitate the development of data. To promote the data-model co-development for MLLM community, we systematically review existing works related to MLLMs from the data-model co-development perspective.
arXiv Detail & Related papers (2024-07-11T15:08:11Z)
SoupLM: Model Integration in Large Language and Multi-Modal Models [51.12227693121004]
Training large language models (LLMs) requires significant computing resources. Existing publicly available LLMs are typically pre-trained on diverse, privately curated datasets spanning various tasks.
arXiv Detail & Related papers (2024-07-11T05:38:15Z)
NoteLLM-2: Multimodal Large Representation Models for Recommendation [60.17448025069594]
We investigate the potential of Large Language Models to enhance multimodal representation in multimodal item-to-item recommendations. One feasible method is the transfer of Multimodal Large Language Models (MLLMs) for representation tasks. We propose a novel training framework, NoteLLM-2, specifically designed for multimodal representation.
arXiv Detail & Related papers (2024-05-27T03:24:01Z)
ModaVerse: Efficiently Transforming Modalities with LLMs [25.49713745405194]
We introduce ModaVerse, a Multi-modal Large Language Model capable of comprehending and transforming content across various modalities. We propose a novel Input/Output (I/O) alignment mechanism that operates directly at the level of natural language.
arXiv Detail & Related papers (2024-01-12T06:28:54Z)
Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics [32.123919380959485]
Multi-modal large language models (MLLMs) are trained based on large language models (LLM) While they excel in multi-modal tasks, the pure NLP abilities of MLLMs are often underestimated and left untested. We show that visual instruction tuning, a prevailing strategy for transitioning LLMs into MLLMs, unexpectedly and interestingly helps models attain both improved truthfulness and ethical alignment.
arXiv Detail & Related papers (2023-09-13T17:57:21Z)
A Survey on Multimodal Large Language Models [71.63375558033364]
Multimodal Large Language Model (MLLM) represented by GPT-4V has been a new rising research hotspot. This paper aims to trace and summarize the recent progress of MLLMs.
arXiv Detail & Related papers (2023-06-23T15:21:52Z)
LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset. Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.