Related papers: Aligning Multimodal LLM with Human Preference: A Survey

Aligning Multimodal LLM with Human Preference: A Survey

URL: http://arxiv.org/abs/2503.14504v2
Date: Sun, 23 Mar 2025 15:07:54 GMT
Title: Aligning Multimodal LLM with Human Preference: A Survey
Authors: Tao Yu, Yi-Fan Zhang, Chaoyou Fu, Junkang Wu, Jinda Lu, Kun Wang, Xingyu Lu, Yunhang Shen, Guibin Zhang, Dingjie Song, Yibo Yan, Tianlong Xu, Qingsong Wen, Zhang Zhang, Yan Huang, Liang Wang, Tieniu Tan,
Abstract summary: Large language models (LLMs) can handle a wide variety of general tasks with simple prompts, without the need for task-specific training.<n>Multimodal Large Language Models (MLLMs) have demonstrated impressive potential in tackling complex tasks involving visual, auditory, and textual data.<n>However, critical issues related to truthfulness, safety, o1-like reasoning, and alignment with human preference remain insufficiently addressed.
Score: 62.89722942008262
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) can handle a wide variety of general tasks with simple prompts, without the need for task-specific training. Multimodal Large Language Models (MLLMs), built upon LLMs, have demonstrated impressive potential in tackling complex tasks involving visual, auditory, and textual data. However, critical issues related to truthfulness, safety, o1-like reasoning, and alignment with human preference remain insufficiently addressed. This gap has spurred the emergence of various alignment algorithms, each targeting different application scenarios and optimization goals. Recent studies have shown that alignment algorithms are a powerful approach to resolving the aforementioned challenges. In this paper, we aim to provide a comprehensive and systematic review of alignment algorithms for MLLMs. Specifically, we explore four key aspects: (1) the application scenarios covered by alignment algorithms, including general image understanding, multi-image, video, and audio, and extended multimodal applications; (2) the core factors in constructing alignment datasets, including data sources, model responses, and preference annotations; (3) the benchmarks used to evaluate alignment algorithms; and (4) a discussion of potential future directions for the development of alignment algorithms. This work seeks to help researchers organize current advancements in the field and inspire better alignment methods. The project page of this paper is available at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Alignment.

Related papers

Efficient Compositional Multi-tasking for On-device Large Language Models [19.179619181605556]
We study the problem of text-based compositional multi-tasking, where each test example involves the simultaneous execution of multiple tasks.<n>Our contributions lay the groundwork for advancing the capabilities of large language models in real-world multi-tasking scenarios, expanding their applicability to complex, resource-constrained use cases.
arXiv Detail & Related papers (2025-07-21T21:39:23Z)
Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey [69.45421620616486]
This work presents the first structured taxonomy and analysis of discrete tokenization methods designed for large language models (LLMs)<n>We categorize 8 representative VQ variants that span classical and modern paradigms and analyze their algorithmic principles, training dynamics, and integration challenges with LLM pipelines.<n>We identify key challenges including codebook collapse, unstable gradient estimation, and modality-specific encoding constraints.
arXiv Detail & Related papers (2025-07-21T10:52:14Z)
Grammar-Guided Evolutionary Search for Discrete Prompt Optimisation [63.97051732013936]
We propose an evolutionary search approach to automated discrete prompt optimisation consisting of two phases.<n>In the first phase, grammar-guided genetic programming is invoked to synthesise prompt-creating programmes.<n>In the second phase, local search is applied to explore the neighbourhoods of best-performing programmes.
arXiv Detail & Related papers (2025-07-14T14:34:15Z)
Fine-tuning Large Language Model for Automated Algorithm Design [23.04239252690957]
We explore fine-tuning of large language models (LLMs) for algorithm design.<n>Our experiments span three distinct algorithm design tasks.<n>Results suggest that finetuned LLMs can significantly outperform their off-the-shelf counterparts.
arXiv Detail & Related papers (2025-07-13T15:21:23Z)
Fitness Landscape of Large Language Model-Assisted Automated Algorithm Search [15.767411435705752]
We show and analyze the fitness landscape of Large Language Models-assisted Algorithm Search. Our findings reveal that LAS landscapes are highly multimodal and rugged. We also demonstrate how population size influences exploration-exploitation trade-offs and the evolving trajectory of elite algorithms.
arXiv Detail & Related papers (2025-04-28T09:52:41Z)
Multi2: Multi-Agent Test-Time Scalable Framework for Multi-Document Processing [35.686125031177234]
Multi-Document Summarization (MDS) is a challenging task that focuses on extracting and synthesizing useful information from multiple lengthy documents. We propose a novel framework that leverages inference-time scaling for this task. We also introduce two new evaluation metrics: Consistency-Aware Preference (CAP) score and LLM Atom-Content-Unit (ACU) score.
arXiv Detail & Related papers (2025-02-27T23:34:47Z)
Gradient-Based Multi-Objective Deep Learning: Algorithms, Theories, Applications, and Beyond [35.78910104369677]
Multi-objective optimization (MOO) in deep learning aims to simultaneously optimize multiple conflicting objectives. Advancements in gradient-based MOO methods have enabled the discovery of diverse types of solutions. These developments have broad applications across domains such as reinforcement learning, computer vision, recommendation systems, and large language models.
arXiv Detail & Related papers (2025-01-19T04:56:55Z)
Empowering Large Language Models in Wireless Communication: A Novel Dataset and Fine-Tuning Framework [81.29965270493238]
We develop a specialized dataset aimed at enhancing the evaluation and fine-tuning of large language models (LLMs) for wireless communication applications. The dataset includes a diverse set of multi-hop questions, including true/false and multiple-choice types, spanning varying difficulty levels from easy to hard. We introduce a Pointwise V-Information (PVI) based fine-tuning method, providing a detailed theoretical analysis and justification for its use in quantifying the information content of training data.
arXiv Detail & Related papers (2025-01-16T16:19:53Z)
Aligning Large Language Models and Geometric Deep Models for Protein Representation [57.59506688299817]
Latent representation alignment is used to map embeddings from different modalities into a shared space, often aligned with the embedding space of large language models (LLMs) Preliminary protein-focused large language models (MLLMs) have emerged, but they have predominantly relied on approaches lacking a fundamental understanding of optimal alignment practices across representations. In this study, we explore the alignment of multimodal representations between LLMs and Geometric Deep Models (GDMs) in the protein domain. Our work examines alignment factors from both model and protein perspectives, identifying challenges in current alignment methodologies and proposing strategies to improve the alignment process.
arXiv Detail & Related papers (2024-11-08T04:15:08Z)
Towards a Unified View of Preference Learning for Large Language Models: A Survey [88.66719962576005]
Large Language Models (LLMs) exhibit remarkably powerful capabilities. One of the crucial factors to achieve success is aligning the LLM's output with human preferences. We decompose all the strategies in preference learning into four components: model, data, feedback, and algorithm.
arXiv Detail & Related papers (2024-09-04T15:11:55Z)
LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models [56.25156596019168]
This paper introduces the LMRL-Gym benchmark for evaluating multi-turn RL for large language models (LLMs) Our benchmark consists of 8 different language tasks, which require multiple rounds of language interaction and cover a range of tasks in open-ended dialogue and text games.
arXiv Detail & Related papers (2023-11-30T03:59:31Z)
Active Representation Learning for General Task Space with Applications in Robotics [44.36398212117328]
We propose an algorithmic framework for textitactive representation learning, where the learner optimally chooses which source tasks to sample from. We provide several instantiations under this framework, from bilinear and feature-based nonlinear to general nonlinear cases. Our algorithms outperform baselines by $20%-70%$ on average.
arXiv Detail & Related papers (2023-06-15T08:27:50Z)
Multi-Task Learning for Dense Prediction Tasks: A Survey [87.66280582034838]
Multi-task learning (MTL) techniques have shown promising results w.r.t. performance, computations and/or memory footprint. We provide a well-rounded view on state-of-the-art deep learning approaches for MTL in computer vision.
arXiv Detail & Related papers (2020-04-28T09:15:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.